多线程/多处理来解析单个 XML 文件？ [复制]答案

【问题标题】：Multithreading/Multiprocessing to parse single XML file? [duplicate]多线程/多处理来解析单个 XML 文件？ [复制]
【发布时间】：2017-09-12 13:43:23
【问题描述】：

谁能告诉我如何将作业分配给多个线程以加快解析时间？例如，我有 200k 行的 XML 文件，我会为每 4 个线程分配 50k 行并使用 SAX 解析器解析它们。到目前为止，我所做的是 4 个线程在 200k 行上解析，这意味着 200k*4 = 800k 复制结果。

感谢任何帮助。

test.xml：

<?xml version="1.0" encoding="utf-8"?>
<votes>
  <row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="5" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>

我的源代码：

import json  
import xmltodict  
from lxml import etree
import xml.etree.ElementTree as ElementTree
import threading
import time

def sax_parsing():

    t = threading.currentThread()

    for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml"):
        #below codes read the attributes in an element specified
        if element.tag == 'row':
            print("Thread: %s" % t.getName())
            row_id = element.attrib.get('Id')
            row_post_id = element.attrib.get('PostId')
            row_vote_type_id = element.attrib.get('VoteTypeId')
            row_user_id = element.attrib.get('UserId')
            row_creation_date = element.attrib.get('CreationDate')
            print('ID: %s, PostId: %s, VoteTypeID: %s, UserId: %s, CreationDate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))
            element.clear()  

    return

if __name__ == "__main__":  

    start = time.time() #calculate execution time

    main_thread = threading.currentThread()
    no_threads = 4
    for i in range(no_threads):
        t = threading.Thread(target=sax_parsing)
        t.start()

    for t in threading.enumerate():
        if t is main_thread:
            continue

    t.join()

    end = time.time() #calculate execution time
    exec_time = end - start
    print('Execution time: %fs' % (exec_time))

【问题讨论】：

也许先尝试解析，然后拆分和线程。
当您执行for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml") 时，您将为所有线程提供相同的 xml 文件进行解析。也许将 test.xml 文件分成 4 部分？
threading 模块没有名为 currentThread() 的函数。它确实有一个名为current_thread()。

标签： python multithreading sax python-multiprocessing python-multithreading

【解决方案1】：

你可以用你的解析函数来接收开始行和结束行的最简单方法，如下所示： def sax_parsing(start, end):

然后在发送线程命令时： t = threading.Thread(target=sax_parsing, args=(i*50, i+1*50))

并将if element.tag == 'row': 更改为if element.tag == 'row' and element.attrib.get('Id') >= start and element.attrib.get('Id') < end：

所以每个线程只检查它在范围内给出的行（实际上并没有检查这个，所以玩玩）

【讨论】：