xml文本中的Python正则表达式，查找标签[重复]答案

【问题标题】：Python regex in xml text, find tags [duplicate]xml文本中的Python正则表达式，查找标签[重复]
【发布时间】：2018-08-22 18:40:58
【问题描述】：

我正在开展一个项目，使用 Python 搜索研究论文的 XML，搜索特定字符串。我已经完成了，但我需要获取搜索结果的最前面的部分标题，即 TITLE 和 LABEL 标记及其内容。

#<..... some XML .....>

<sec id="aj387295s3">
<label>3.</label>
<title><italic>CHANDRA</italic> OBSERVATIONS</title>
<p>The 13 candidates were observed with the Advanced CCD Imaging 
Spectrometer (ACIS; Burke et&nbsp;al. <xref ref-type="bibr" 
rid="aj387295r8">1997</xref>) on board <italic>Chandra</italic> 
(Weisskopf et&nbsp;al. <xref ref-type="bibr" 
rid="aj387295r46">1996</xref>). We chose the S3 chip to image the 
sources because of its better low-energy sensitivity. The standard 
TIMED readout with a frame time of 3.2 s was used, and the data were 
collected in VFAINT mode. In 12 cases, our <italic>Chandra</italic> 
observations led us to conclude that the RASS detection was not of a 
candidate INS (see Table&nbsp;<xref ref-type="table" 
rid="aj387295t1">1</xref>; the <xref ref-type="sec" 
rid="aj387295app1">Appendix</xref> includes a case-by-case discussion 
of these sources).</p>

#<..... more XML ....>

我有一个正则表达式来获取包含“Chandra”的行，但我一直在努力尝试获取“3.CHANDRA OBSERVATIONS”。这可能非常明显，但我没有太多的正则表达式培训。我对 Chandra 的正则表达式和该行的其余部分是 "(.*)(c|C)handra\b"

谢谢！ -珍妮

【问题讨论】：

不要使用正则表达式解析 XML。使用ElementTree 或lxml。
我主要使用 BeautifulSoup，因为出于某种原因，它的合作更好
when not to use RegEx 上的人生课程。请改用 Daniel 或 Jenny 推荐的模块。
@Jenny：到目前为止你尝试过什么？
@EliasStrehle 是的，原帖是我，珍妮

标签： python regex xml beautifulsoup

【解决方案1】：

如果你找到了正确的<sec>-tag，你只需要获取<label>和<title>中的文字即可。

title = '{} {}'.format(sec.findtext('label'), ''.join(sec.find('title').itertext())

【讨论】：