【发布时间】:2015-03-23 02:00:59
【问题描述】:
我正在使用 Python 中的 NYT 语料库,并尝试仅提取每个 .xml 文章文件的“full_text”类中的内容。例如:
<body.content>
<block class="lead_paragraph">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
<block class="full_text">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
理想情况下,我只想解析出字符串,产生“LEAD:两名警察回应报告的抢劫案......”但我不确定最好的方法是什么。这是可以通过正则表达式轻松解析的东西吗?如果是这样,我尝试的任何方法似乎都不起作用。
任何建议将不胜感激!
【问题讨论】: