在java和python中解析非常大的bz2 xml文件（逐个元素）答案

【问题标题】：Parsing very large bz2 xml file (element by element) in java and python在java和python中解析非常大的bz2 xml文件（逐个元素）
【发布时间】：2015-10-14 20:39:00
【问题描述】：

我有一个 20gb bz2 xml 文件。格式是这样的：

<doc id="1" url="https://www.somepage.com" title="some page">
text text text ....
</doc>

我需要将它处理成这种格式的tsv文件：

id<tab>url<tab>title<tab>processed_texts

在 python 和 java 中最有效的方法是什么，有什么区别（内存效率和速度方面）。基本上我想这样做：

read bz2 file
read the xml file element by element
for each element
    retrieve id, url, title and text
    print_to_file(id<tab>url<tab>title<tab>process(text))

提前感谢您的回答。

UPDATE1（基于@Andreas 的建议）：

XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
xmlReader.nextTag(); 
    if (! xmlReader.getLocalName().equals("doc")) {
        xmlReader.nextTag(); }

        String id      = xmlReader.getAttributeValue(null, "id");
        String url     = xmlReader.getAttributeValue(null, "url");
        String title   = xmlReader.getAttributeValue(null, "title");
        String content = xmlReader.getElementText();
        out.println(id +  '\t' + content);

问题是我只得到第一个元素。

UPDATE2（我最终使用正则表达式）：

if (str.startsWith("<doc")) {
                id = str.split("id")[1].substring(2).split("\"")[0];
                url = str.split("url")[1].substring(2).split("\"")[0];
                title = str.split("title")[1].substring(2).split("\"")[0];
     }
else if (str.startsWith("</doc")) {
                out.println(uniq_id +  '\t' + contect);
                content ="";

      } 
else {
                content = content + " " + str;
      }

【问题讨论】：

标签： java python xml

【解决方案1】：

注意：下面的答案适用于解析非常大的 BZ2 压缩 XML 文档，但是 OP 的 XML 文件格式不正确，因为没有根元素，即它是一个 XML 片段。

内置的 StAX 解析器不支持 XML 片段，但是 Woodstox XML processor 应该支持这一点，根据这个答案：Parsing multiple XML fragments with STaX。

Java 答案

正如在这个问题 (Uncompress BZIP2 archive) 中的回答，您需要 Apache Commons Compress™ 才能读取 BZ2 文件。

然后您将使用内置的StAX 解析器：

File xmlFile = new File("input.xml");
File textFile = new File("output.txt");
try (InputStream in = new BZip2CompressorInputStream(new FileInputStream(xmlFile));
     PrintWriter out = new PrintWriter(new FileWriter(textFile))) {

    XMLInputFactory factory = XMLInputFactory.newFactory();
    XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
    try {
        xmlReader.nextTag(); // Read root element, ignore it
        if (xmlReader.getLocalName().equals("doc"))
            throw new IllegalArgumentException("Expected root element, found <doc>");
        while (xmlReader.nextTag() == XMLStreamConstants.START_ELEMENT) {
            if (! xmlReader.getLocalName().equals("doc"))
                throw new IllegalArgumentException("Expected <doc>, found <" + xmlReader.getLocalName() + ">");
            String id      = xmlReader.getAttributeValue(null, "id");
            String url     = xmlReader.getAttributeValue(null, "url");
            String title   = xmlReader.getAttributeValue(null, "title");
            String content = xmlReader.getElementText();
            // process content value
            out.println(id + '\t' + url + '\t' + title + '\t' + content);
        }
    } finally {
        xmlReader.close();
    }
}

快速且低内存占用。

【讨论】：

解决方案给了我这个错误：消息：找到：CHARACTERS, expected START_ELEMENT or END_ELEMENT。我根据你的尝试了各种解决方案，但我只能解析第一个 doc 元素。
使用正则表达式将其作为文本处理可能更容易。
糟糕，if 语句中缺少 !。此外，如果您的 XML 格式不正确，并且缺少根元素，您将收到该错误。
此外，正则表达式通常是解析 XML 的一个非常糟糕的选择。 XML 格式对于正则表达式来说太复杂了。
我没有根元素。它向我抛出了这个错误：Expected root element, found <doc>，删除该 if 和错误消息，给我这个错误：消息：找到：CHARACTERS, expected START_ELEMENT or END_ELEMENT 这与没有根有关。