【发布时间】:2017-05-29 02:00:11
【问题描述】:
大家早上好; 我正在尝试使用此代码提取 SGML 文档,但结果是空文档,这是我的 python 代码:
from os import listdir
from os import makedirs
from os.path import isfile, join
from re import sub
import ast
import numpy
import xml.etree.ElementTree as ElementTree
from lxml import etree
parser = etree.XMLParser(recover=True) # escaping malformed strings
pathCol="C:/Users/Desktop/FR"
pathExtr="C:/Users/Desktop/FRExt"
i=0
for f in listdir(pathCol):
with open(join(pathCol,f), 'r') as f: # Reading file
xml = f.read()
xml = '<ROOT>' + xml + '</ROOT>' # Let's add a root tag
root = etree.fromstring(xml, parser=parser)
for doc in root:
try :
docNo=doc.find('DOCNO').text.strip()
except :
i+=1
docNo="LATIMES"+str(i)
try :
text=doc.find('TEXT').text.strip()
except :
try :
text=doc.find('HEADLINE').text.strip()
except :
try :
text=doc.find('GRAPHIC').text.strip()
except :
text=" "
fi=open(join(pathExtr,docNo),'w')
fi.write(text)
fi.close()
print("%s OK" %(docNo))
f.close()
这是一个示例文档的结构:
<DOC>
<DOCNO> LA010189-0001 </DOCNO>
<DOCID> 1 </DOCID>
<DATE>
<P>
January 1, 1989, Sunday, Home Edition
</P>
</DATE>
<SECTION>
<P>
Book Review; Page 1; Book Review Desk
</P>
</SECTION>
<LENGTH>
<P>
1206 words
</P>
</LENGTH>
<HEADLINE>
<P>
NEW FALLOUT FROM CHERNOBYL;
</P>
<P>
THE SOCIAL IMPACT OF THE ...
</P>
</HEADLINE>
<BYLINE>
<P>
By James E. ...
</P>
</BYLINE>
<TEXT>
<P>
The onset of the new Gorbachev policy of glasnost,...
</P>
...
</TEXT>
</DOC>
<DOC>
... etc
</DOC>
我想为<DOC> 和</DOC> 之间的每个文档获取<TEXT> 标签之间的内容,而不是我有空文档:(
请问,有没有人可以帮助我?
非常感谢。
【问题讨论】: