【发布时间】:2020-06-18 08:32:45
【问题描述】:
- 我有一个日志文件,其中包含 XML 和 Jason 等混合字符串。
- 我必须提取它们中的每一个以获得所需的输出。
- xml_string 包含可以换行的元素。
- 我首先读取所有行,然后使用 Environment.NewLine 将它们断开以获取 log_file_lines,然后在 IEnumerable 中为 xml_lines 和 jason_lines 获取行。
- 但是,当 xml_lines 中有新行时,就会出现问题。这意味着我在字段中得到了一半或未格式化的 xml_ 行。
- 在正则表达式中有什么方法或从 log_file_string 文本本身中删除这些行,然后将 t 传递给 log_file_lines ?
- 另一个选项可以是使用正则表达式从 log_file_string 中的节点和 xml_lines 之间获取数据,而无需循环,因为数据是 3 MB 文件:(
log_file_string 的显示方式如下:
2020-06-10T10:58:07.0792762Z [data_type_jason] {"person_id":"101", "order_id":"123"}
2020-06-12T10:58:07.0792762Z [data_type_xml] <?xml version="1.0"?><persons><person id = "101"><name>"Thomas Edison"</name><age>"35"</age><phone>"7777777777"</phone><address>"62 Ross Road,
MARSHAM, NR10 6EA"</address><country>"England"</country></person></persons>
2020-06-13T10:58:07.0792762Z [data_type_jason] {"person_id":"102", "order_id":"140"}
2020-06-14T10:58:07.0792762Z [data_type_xml]<?xml version="1.0"?><persons><person id = "102"><name>"Louis Pasture"</name><age>"40"</age><phone>"99999999"</phone><address>"145 Thames Street, BOOSBECK, TS12 1AN"</address><country>"England"</country></person></persons>
这是完整的原型宝贝:
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Xml.Linq;
namespace Test
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}//Form1
private void Form1_Load(object sender, EventArgs e)
{
String file_folder = @"X:\VS 2019\C Sharp\Test";
String file_path = Path.Combine(file_folder, "log_file.txt");
process_log_file_data(file_folder, file_path);
}//Form1_Load
private void process_log_file_data(String file_folder, String file_path)
{
String log_file_string = read_all_lines_from_file(file_folder, file_path);
String[] log_file_lines = log_file_string.Split(new String[] { Environment.NewLine }, StringSplitOptions.None);
//The other option can be getting data between just to nodes <persons> and </persons> in xml_lines from the log_file_string using regex but am not regex savvy :(
IEnumerable<String> xml_lines = from line in log_file_lines
where line.Contains("data_type_xml")
select line;
IEnumerable<String> jason_lines = from line in log_file_lines
where line.Contains("data_type_jason")
select line;
XDocument xml_document = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("xml_data"));
foreach (var xml_line in xml_lines)
{
String line = xml_line.Split(new String[] { "[data_type_xml]" }, StringSplitOptions.None)[xml_line.Split(new String[] { "[data_type_xml]" }, StringSplitOptions.None).GetUpperBound(0)].Trim();
//here is the issue < persond id = "101" >< address > as the there is a line break in log_file_lines the xml_line = 2020-06-12T10:58:07.0792762Z [data_type_xml] <?xml version="1.0"?><person><person id = "101"><name>"Thomas Edison"<name><age>"35"</age><phone>"7777777777"</phone><address>"62 Ross Road
XDocument temp_xml_document = XDocument.Parse(line); //Unexpected end of file has occurred. The following elements are not closed: address, person, persons. Line 1, position 144.'
}
foreach (var jason_line in jason_lines)
{
//do something
}
}//process_log_file_data(String file_folder, String file_path)
private String read_all_lines_from_file(String file_folder, String file_path)
{
FileInfo file_info = new FileInfo(file_path);
if ((!file_info.Exists) || (file_info.Length == 0))
{
return String.Empty;
}
FileStream file_stream; StreamReader stream_reader; UTF8Encoding utf8_encoding; String file_text;
file_stream = new FileStream(file_path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
utf8_encoding = new UTF8Encoding(false);
stream_reader = new StreamReader(file_stream, utf8_encoding);
file_text = stream_reader.ReadToEnd();
stream_reader.Close();
file_stream.Close();
return file_text;
}//read_all_lines_from_file
}//Form1 : Form
}//Test
【问题讨论】:
-
你能举例说明原始日志文件的样子吗?
-
嗨,马格努斯。感谢您对此进行调查。我已经更新了问题并在下面添加了日志文件文本。再次感谢您对此进行调查。
-
如果 xml 中有换行符,该行是否仍以 [data_type_xml] 开头?
-
我个人会按顺序处理每一行。如果一行不以 datetime+type 开头,只需将其附加到处理的最后一行。然后在第二遍进行实际解释。
-
好吧,无论如何您都不需要将整个文件读入内存。您可以在从文件中读取文本行的同时执行 JSON 和 XML 工作。
标签: c# regex xml linq streamreader