从 XML 文件中提取字数答案

【问题标题】：Extracting word count From an XML File从 XML 文件中提取字数
【发布时间】：2012-10-18 13:19:00
【问题描述】：

（这个问题与我之前在stackoverflow上发布的上一个问题有关......这是链接

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario)

问题是，如果我想获取每个参与者在所有句子中写的单词，而不是获取句子，请牢记上述情况。例如。如果“预算”一词总共使用了 10 次，参与者“Dolske”使用了 7 次，其他人使用了 3 次。所以我需要所有单词的列表以及每个参与者写了多少次？还有每回合的单词列表？

实现这一目标的最佳策略是什么？有示例代码吗？

XML附在此处（您也可以在参考问题中查看）

"(495584) Firefox - 搜索建议将错误的先前结果传递给形成历史记录"

<Turn>
  <Date>'2009-06-14 18:55:25'</Date>
  <From>'Justin Dolske'</From>
  <Text>
    <Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
    <Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
    <Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
    <Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
  </Text>
</Turn>

<Turn>
  <Date>'2009-06-19 12:07:34'</Date>
  <From>'Gavin Sharp'</From>
  <Text>
    <Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
    <Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
  </Text>
</Turn>

<Turn>
  <Date>'2009-06-19 13:17:56'</Date>
  <From>'Justin Dolske'</From>
  <Text>
    <Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
    <Sentence ID = "5.2"> &amp;gt; (From update of attachment 383211 [details] [details])</Sentence> 
    <Sentence ID = "5.3"> &amp;gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
    <Sentence ID = "5.4"> Good point.</Sentence>
    <Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
    <Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
  </Text>
</Turn>

..... 等等

我们将非常感谢您的帮助...

【问题讨论】：

标签： java dom jaxb sax

【解决方案1】：

获取所有值，最好使用 sTax 解析器，这对这类任务很有用。然后将所有句子分成单词并做任何你想做的事情。就像使用 Class Turn 创建一个模型一样，您可以在其中存储作者和句子，为这个类编写服务并继续。 :)

要在单词中拆分句子，请使用 split() 或 StringTokenizer，但不推荐使用 tokenizer。要使用拆分，请创建一个临时数组，例如

stringArray = sentence.toString().split(" ");

或者像“sentence.getValue()”之类的。

在方法参数中放置正则表达式的位置。在您的情况下，它是一个简单的空格，因为它会拆分句子。然后你就可以复习一下单词并计算你需要什么。

如果是 ArrayList，请使用 List.toArray() 在数组视图中获取您的列表。

【讨论】：

我已经有每个参与者 ArrayList sentenceList 的句子列表。有没有办法从每个句子中获取所有单词？这是一个艰难的方法吗？我只是避免再次编写代码...
朋友，分词的方法有很多。我将编辑我的答案。