根据内容将txt文件解析为多个单独的文件答案

【问题标题】：Parsing a txt file into multiple separate files depending on content根据内容将txt文件解析为多个单独的文件
【发布时间】：2017-09-26 15:13:25
【问题描述】：

我在创建最后一个文件时遇到问题。

我有一个制表符分隔的文本文件，看起来像这样。

KABEL   Provkanna for Windchill_NWF-TSNM    =2212.U001+++-X2    PXC.2400016             =2271.U004+++-X1    Test_Created_in_WT              =2212-W123  RXF 4x25    0000000440  Cable RXF 4x25
PART        01      1   1       
PART        02      2   2       
PART        03      3   3       
PART        04      4   4       
PART        SH      GND GND     
KABEL   Provkanna for Windchill_NWF-TSNM    =2212.U001+++-X2    PXC.2400016             =2271.U004+++-X1    Test_Created_in_WT              =2212-W124  RXF 4x35    0000000456  Cable RXF 4x35
PART        01  1   5   5       
PART        02  1   6   6       
PART        03  1   7   7       
PART        04  1   8   8       
PART        SH  1   GND GND     
KABEL   Provkanna for Windchill_NWF-TSNM    =2212.U001+++-X2    PXC.2400016             =2271.U004+++-X1    Test_Created_in_WT              =2212-W125  RXF 4x35    0000000456  Cable RXF 4x35
PART        01  1   9   9       
PART        02  1   10  10      
PART        03  1   11  11      
PART        04  1   12  12      
PART        SH  1   GND GND

基本上它是以单词 KABEL 开头的一行，后跟许多制表符分隔的列。该行之后是一些以单词 PART 开头的行。以 PART 开头的行数可以不同。

现在我想把这个文件分解成几个文件。

每个已解析的文件都应有一个名称，该名称包含来自以 KABEL 开头的行的某一列的信息。在该文件中，应添加以 PART 开头的每一行。

然后，当再次出现以 KABEL 开头的行时，将创建一个新文件，并将 PART 行添加到该文件中......等等......等等。

我反复尝试了很多次，最终找到了正确创建前两个文件的方法……但是……最后一个文件不会被创建。

我的脚本读取并找到并显示正确的列，该列应该是最后解析的输出文件的唯一部分，但我没有看到任何正在输出的文件。

有接受者吗？自从我陷入困境以来，我将非常感谢您的帮助...

{
    string line ="";
    string ColumnValue ="";
    string Starttext = "PART";
    string Kabeltext = "KABEL";
    int column = 16;     
    string FilenameWithoutCabelNumber = @"C:\Users\tsnm2171\Desktop\processed\LABB\OUTPUT - Provkanna for Windchill_NWF-TSNM_2212_CABLE_CONNECTION";
    string ExportfileIncCablenumber ="";
    string filecontent ="";

    using (System.IO.StreamReader reader = new System.IO.StreamReader(@"C:\Users\tsnm2171\Desktop\processed\LABB\Provkanna for Windchill_NWF-TSNM_2212_CABLE_CONNECTION.txt"))          
    {       
        line = reader.ReadLine();

        //Set columninnehåll till filnamn (String ColumnValue)   
        string [] words = line.Split();
        ColumnValue = words[column];

        MessageBox.Show (ColumnValue);

        while (line != null)                        
        {   
            line = reader.ReadLine();

            if (line.StartsWith(Kabeltext)) // if line starts with KABEL 
            {   
                ExportfileIncCablenumber =  (FilenameWithoutCabelNumber + "-" + ColumnValue + ".txt");
                System.IO.File.WriteAllText(ExportfileIncCablenumber, filecontent);

                filecontent = string.Empty;
                string [] words2 = line.Split();
                ColumnValue = words2[column];

                MessageBox.Show("Ny fil " + ColumnValue);
            }
            else if (line.StartsWith(Starttext)) // if line starts with PART
            {
                filecontent += ((line)+"\n");           //writes the active line                                
            }                   
        }
        ExportfileIncCablenumber =  (FilenameWithoutCabelNumber + "-" + ColumnValue + ".txt");
        System.IO.File.WriteAllText(ExportfileIncCablenumber, filecontent);                     filecontent = "";                                                                   
    }
}

提前致谢

托马斯

【问题讨论】：

这不是制表符分隔的文件。那是一个包含复杂记录的文件。您需要编写一个解析器来了解每条记录何时开始以及如何处理每一行。你不能在一个循环中做到这一点。您应该编写可以识别每种类型的行的函数/类，例如，如果它以 KABEL 开头，则为 Header，如果以 PART 开头，则为 PART。之后每个函数识别自己的字段要容易得多，例如 PART 只需检查 3 个字段
顺便说一句，有一些工具可以让您创建像 ANTLR 或 FParsec 这样的解析器。您无需为每种类型的记录编写“识别器”，而是使用语法规则。

标签： c# parsing writealltext

【解决方案1】：

首先，您应该像这样进行读取行和空值检查模式 while((line = reader.ReadLine()) != ) 因为它可以保护您免受空引用。我的版本，这似乎有效：

{
        const string StartText                  = "PART";
        const string KabelText                  = "KABEL";  
        const string FilenameWithoutCabelNumber = @"...\";

        string fileContent = "";
        int    fileNumber  = 0;

        using (StreamReader reader = File.OpenText(@"...\file.txt"))
        {       
            string line = reader.ReadLine();
            string columnValue = GetParticularColumnName(line);
            //Set columninnehåll till filnamn (String ColumnValue)   
            MessageBox.Show (ColumnValue);

            var ExportfileIncCablenumber ="";
            while ((line = reader.ReadLine()) != null)         
            {   
                if (line.StartsWith(KabelText)) // if line starts with KABEL 
                {   
                    ExportfileIncCablenumber =  $"{FilenameWithoutCabelNumber}-{columnValue}({fileNumber}).txt";

                    File.WriteAllText(ExportfileIncCablenumber, fileContent);

                    fileContent = string.Empty;
                    columnValue = GetParticularColumnName(line);
                    fileNumber++;
                }
                else if (line.StartsWith(StartText)) // if line starts with PART
                {
                    fileContent += ((line)+Environment.NewLine);    //writes the active line                                
                }                   
            }

            ExportfileIncCablenumber =  (FilenameWithoutCabelNumber + "-" + columnValue + ".txt");
            File.WriteAllText(ExportfileIncCablenumber, fileContent);
        }
    }

    private static string GetParticularColumnName(string line)
    {
        return line.Split(' ').Last();
    }

您在保存文件时遇到的问题是因为误解了String.Split() 的工作原理。详情请见docs，但要简短：

如果分隔符参数为空或不包含字符，则方法将空白字符视为分隔符。

这就是为什么你有一个包含单词和空字符串的数组。 column 正在选择空字符串，这就是为什么您让一个文件覆盖另一个文件的原因。（列值 16 也是错误的，实际上有 15 个字）。你所有的行都被连接起来了，因为 windows 不会把 '\n' 当作结束行字符，这就是我使用Environment.NewLine 的原因最后但并非最不重要的问题是您的代码风格。真的，您应该遵守 .Net 的常见 coding conventions，因为这会使您的代码连贯且更具可读性。

【讨论】：

编码约定不会帮助 OP 编写解析器。
我认为指出这一点是合理的，而 OP 仍在学习，在他真正学习坏习惯并在他成熟的代码中传播它们之前。如果您在我的解释中发现任何缺陷，请分享。我认为很明显，OP 应该查看代码的作用并在不确定结果时检查文档。
那么我建议你在学习的同时检查解析器。您会意识到 this 答案不是答案。有经验的开发人员也不会一次读一行，他们使用ReadLines 返回IEnumerable<string>。此答案侧重于琐碎的错误，而不是实际的、更具挑战性的问题
PS 人也很少使用Environment.NewLine，不管 15 年前 MSDN 文档怎么说。谁说 your 环境的换行符对于文件的 consumer 是可以接受的？此外，IO 类和方法可以很好地处理单个 \n。经验丰富的开发人员也不附加或拆分字符串。这会生成不必要的临时字符串。他们使用 StreamWriter 或 StringBuilder 并在必要时编写 lines。他们使用正则表达式来解析行并避免多次拆分浪费CPU和RAM