【发布时间】:2017-09-12 13:43:23
【问题描述】:
谁能告诉我如何将作业分配给多个线程以加快解析时间?例如,我有 200k 行的 XML 文件,我会为每 4 个线程分配 50k 行并使用 SAX 解析器解析它们。到目前为止,我所做的是 4 个线程在 200k 行上解析,这意味着 200k*4 = 800k 复制结果。
感谢任何帮助。
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<votes>
<row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="5" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>
我的源代码:
import json
import xmltodict
from lxml import etree
import xml.etree.ElementTree as ElementTree
import threading
import time
def sax_parsing():
t = threading.currentThread()
for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml"):
#below codes read the attributes in an element specified
if element.tag == 'row':
print("Thread: %s" % t.getName())
row_id = element.attrib.get('Id')
row_post_id = element.attrib.get('PostId')
row_vote_type_id = element.attrib.get('VoteTypeId')
row_user_id = element.attrib.get('UserId')
row_creation_date = element.attrib.get('CreationDate')
print('ID: %s, PostId: %s, VoteTypeID: %s, UserId: %s, CreationDate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))
element.clear()
return
if __name__ == "__main__":
start = time.time() #calculate execution time
main_thread = threading.currentThread()
no_threads = 4
for i in range(no_threads):
t = threading.Thread(target=sax_parsing)
t.start()
for t in threading.enumerate():
if t is main_thread:
continue
t.join()
end = time.time() #calculate execution time
exec_time = end - start
print('Execution time: %fs' % (exec_time))
【问题讨论】:
-
也许先尝试解析,然后拆分和线程。
-
当您执行
for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml")时,您将为所有线程提供相同的 xml 文件进行解析。也许将 test.xml 文件分成 4 部分? -
threading模块没有名为currentThread()的函数。它确实有一个名为current_thread()。
标签: python multithreading sax python-multiprocessing python-multithreading