选择关键词之间的文本答案

【问题标题】：Select text between key words选择关键词之间的文本
【发布时间】：2019-10-25 15:09:02
【问题描述】：

这是Select block of text and merge into new document的后续问题

我有一个添加了 cmets 的 SGM 文档，并且我的 sgm 文件中有 cmets。我需要提取启动/停止 cmets 之间的字符串，以便可以将它们放在临时文件中进行修改。现在它正在选择所有内容，包括启动/停止 cmets 和启动/停止 cmets 之外的数据。

Dim DirFolder As String = txtDirectory.Text
Dim Directory As New IO.DirectoryInfo(DirFolder)
Dim allFiles As IO.FileInfo() = Directory.GetFiles("*.sgm")
Dim singleFile As IO.FileInfo
Dim Prefix As String
Dim newMasterFilePath As String
Dim masterFileName As String
Dim newMasterFileName As String
Dim startMark As String = "<!--#start#-->"
Dim stopMark As String = "<!--#stop#-->"
searchDir = txtDirectory.Text
Prefix = txtBxUnique.Text

For Each singleFile In allFiles
    If File.Exists(singleFile.FullName) Then
        Dim fileName = singleFile.FullName
        Debug.Print("file name : " & fileName)
        ' A backup first    
        Dim backup As String = fileName & ".bak"
        File.Copy(fileName, backup, True)

        ' Load lines from the source file in memory
        Dim lines() As String = File.ReadAllLines(backup)

        ' Now re-create the source file and start writing lines inside a block
        ' Evaluate all the lines in the file.
        ' Set insideBlock to false
        Dim insideBlock As Boolean = False
        Using sw As StreamWriter = File.CreateText(backup)
            For Each line As String In lines
                If line = startMark Then
                    ' start writing at the line below
                    insideBlock = True
                    ' Evaluate if the next line is <!Stop>
                ElseIf line = stopMark Then
                    ' Stop writing
                    insideBlock = False
                ElseIf insideBlock = True Then
                    ' Write the current line in the block
                    sw.WriteLine(line)
                End If
            Next
        End Using
    End If

Next

这是要测试的示例文本。

<chapter id="Chapter_Overview"> <?Pub Lcl _divid="500" _parentid="0"> 
<title>Learning how to gather data</title>
<!--#start#-->
<section>
<title>ALTERNATE MISSION EQUIPMENT</title>
<para0 verdate="18 Jan 2019" verstatus="ver">
<title>
<applicabil applicref="xxx">
</applicabil>Three-Button Trackball Mouse</title>
<para>This is the example to grab all text between start and stop comments. 
</para></para0>
</section>
<!--#stop#-->

注意事项：开始和停止 cmets 总是换行，一个文档可以有多个开始/停止部分

我想也许可以使用正则表达式

(<section>[\w+\w]+.*?<\/section>)\R(<\?Pub _gtinsert.*>\R<pgbrk pgnum.*?>\R<\?Pub /_gtinsert>)*

或者也许使用 IndexOf 和 LastIndexOf，但我无法让它工作。

【问题讨论】：

如果它选择所有可能是您的 test line = startMark 永远不会匹配。您是否仔细比较了“行”以查看它是否包含任何非打印字符？或者试试 line.StartsWith(startMark)？
这是该行中唯一的文本。实际上属于自己的路线。如果我能得到结尾的位置到的开头，这就是我所需要的
如果这是一个 SGML 文档（看起来像一个），您可以使用现有的 SGML/XML/HTML 解析器，例如 SgmlReader 或 HtmlAgilityPack，两者都可作为 NuGet 包使用。前者在 1.8.11 版本中作为 NuGet 包提供，而 GitHub 存储库有 v.1.8.12

标签： vb.net text-extraction

【解决方案1】：

您可以读取整个文件并使用{"", ""}的字符串数组将其拆分为一个数组，将其拆分为this

元素 0："" 之前的文本
元素 1："" 和 "" 之间的文本
元素 2："" 之后的文本

并获取元素 1。然后将其写入备份。

Dim text = File.ReadAllText(backup).Split({startMark, stopMark}, StringSplitOptions.RemoveEmptyEntries)(1)
Using sw As StreamWriter = File.CreateText(backup)
    sw.Write(text)
End Using

编辑地址评论

我确实使原始代码有点紧凑。它可以扩展为以下内容，允许您添加一些验证

Dim text = File.ReadAllText(backup)
Dim split = text.Split({startMark, stopMark}, StringSplitOptions.RemoveEmptyEntries)
If split.Count() <> 3 Then Throw New Exception("File didn't contain one or more delimiters.")
text = split(1)
Using sw As StreamWriter = File.CreateText(backup)
    sw.Write(text)
End Using

【讨论】：

我收到 System.IndexOutOfRangeException 'index was outside of the array' 我将代码 Dim insdieBlock As Boolean 替换为结束使用并将您的代码粘贴到那里。还有我用什么代码来提取元素并将其写入我的备份
Index out of bounds when indexing with 1 意味着没有找到分隔符字符串，因为只返回了一个元素（在这种情况下最多可以使用索引 0，对吗？）。您可以在解析之前检查分隔符字符串是否存在。
@MaxineHammett 请查看我的编辑。使用扩展代码进行测试。如果它抛出该异常，请在 Throw New Exception ... 上中断并检查 text 是否包含两个分隔符。