【问题标题】:Multithreading/Multiprocessing to parse single XML file? [duplicate]多线程/多处理来解析单个 XML 文件? [复制]
【发布时间】:2017-09-12 13:43:23
【问题描述】:

谁能告诉我如何将作业分配给多个线程以加快解析时间?例如,我有 200k 行的 XML 文件,我会为每 4 个线程分配 50k 行并使用 SAX 解析器解析它们。到目前为止,我所做的是 4 个线程在 200k 行上解析,这意味着 200k*4 = 800k 复制结果。

感谢任何帮助。

test.xml:

<?xml version="1.0" encoding="utf-8"?>
<votes>
  <row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="5" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>

我的源代码:

import json  
import xmltodict  
from lxml import etree
import xml.etree.ElementTree as ElementTree
import threading
import time

def sax_parsing():

    t = threading.currentThread()

    for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml"):
        #below codes read the attributes in an element specified
        if element.tag == 'row':
            print("Thread: %s" % t.getName())
            row_id = element.attrib.get('Id')
            row_post_id = element.attrib.get('PostId')
            row_vote_type_id = element.attrib.get('VoteTypeId')
            row_user_id = element.attrib.get('UserId')
            row_creation_date = element.attrib.get('CreationDate')
            print('ID: %s, PostId: %s, VoteTypeID: %s, UserId: %s, CreationDate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))
            element.clear()  

    return

if __name__ == "__main__":  

    start = time.time() #calculate execution time

    main_thread = threading.currentThread()
    no_threads = 4
    for i in range(no_threads):
        t = threading.Thread(target=sax_parsing)
        t.start()

    for t in threading.enumerate():
        if t is main_thread:
            continue

    t.join()

    end = time.time() #calculate execution time
    exec_time = end - start
    print('Execution time: %fs' % (exec_time))

【问题讨论】:

  • 也许先尝试解析,然后拆分和线程。
  • 当您执行for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml") 时,您将为所有线程提供相同的 xml 文件进行解析。也许将 test.xml 文件分成 4 部分?
  • threading 模块没有名为 currentThread() 的函数。它确实有一个名为current_thread()

标签: python multithreading sax python-multiprocessing python-multithreading


【解决方案1】:

你可以用你的解析函数来接收开始行和结束行的最简单方法,如下所示: def sax_parsing(start, end):

然后在发送线程命令时: t = threading.Thread(target=sax_parsing, args=(i*50, i+1*50))

并将if element.tag == 'row': 更改为if element.tag == 'row' and element.attrib.get('Id') &gt;= start and element.attrib.get('Id') &lt; end

所以每个线程只检查它在范围内给出的行 (实际上并没有检查这个,所以玩玩)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-09-11
    • 2021-01-23
    • 1970-01-01
    • 1970-01-01
    • 2012-06-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多