【问题标题】:Retrieve info between paragraph tags with feedparser使用 feedparser 检索段落标签之间的信息
【发布时间】:2014-07-12 19:11:43
【问题描述】:

我一直在阅读 feedparser 的文档,但找不到解决方案:我只想检索 <p></p> 之间的字符串。我想从中检索的提要摘录的一个示例是:

<img alt="Dawsons" height="259" src="http://i.cbc.ca/1.2703554.1405073659!/fileImage/httpImage/image.jpg_gen/derivatives/16x9_460/dawsons.jpg" title="Kathy Dawson and her daughter Emily Dawson, 18, now have a complaint before the Alberta Human Rights Commission over a sexual education course Emily had to take last year. " width="460" /> <p>The Edmonton Public School Board has said it will tell teachers not to use an anti-abortion centre to teach part of its sexual education curriculum, after a McNally high school student filed a human rights complaint over what she was taught.</p>

注意:这是来自http://www.cbc.ca/cmlink/rss-topstories的RSS提要

我检索到的

for item in cbc.entries:
    print item.summary

我知道我可以很容易地编写一些东西来手动解析并只返回我想要的东西,但是有什么方法可以让 feedparser 为我做这件事吗?

【问题讨论】:

  • 如果你只想要文字beautifulsoup可以很容易地得到它
  • 谢谢!希望不要涉及美丽的汤,但这似乎是一个简单有效的解决方案

标签: python rss feedparser


【解决方案1】:

我在文档中没有看到关于使用标签进行解析的任何内容,但 beautifulsoup 可以获取文本;

from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.cbc.ca/cmlink/rss-topstories")
soup = BeautifulSoup(r.content)
print [''.join(s.findAll(text=True)) for s in soup.findAll('p')]

[u"Search teams are returning to the home of Kathy and Alvin Liknes today for another sweep of the property, close to two weeks after the couple and their grandson Nathan O'Brien were discovered missing in Calgary.", u"Israel widened its air assault against the Gaza Strip's Hamas militants on Saturday, hitting targets that included a mosque the Israeli military said was being used to conceal rockets. Meanwhile, there are reports Hamas has launched rockets at Tel Aviv.", u'The Sunni militant group ISIS, which wants to create an Islamic state spanning Iraq and Syria, has issued a recruitment video using the image and words of a dead Ontario man who had become a jihadist and joined the fighting in Syria.', u'A Hamilton-area man\u2019s dashcam may have saved him a pricey car insurance payout \u2013 and maybe even from falling victim to an insurance scam, an industry expert says.', u'Tommy Ramone, a co-founder of the seminal punk band the Ramones and the last surviving member of the original group, has died, a business associate said Saturday.', u"During high-stake police interrogations and on seemingly meaningless online dating profiles, some people find themselves lying. So, how can you tell if someone isn't telling you the truth?", u"Israeli strikes in Gaza have led to sleepless nights and anxious Palestinian children, CBC's Derek Stoffel reports from a refugee camp in Gaza City.", u'Saskatchewan Premier Brad Wall has been a vocal proponent of abolishing the Senate. With the Prime Minister now under pressure to fill vacancies in the upper chamber, Wall argues that not appointing new senators might be the way to get rid of the institution.', u"Bassist Charlie Haden, who helped change the shape of jazz more than a half-century ago as a member of Ornette Coleman's groundbreaking quartet and liberated the bass from its traditional rhythm section role, has died. He was 76.", u"Tracy Morgan has sued Wal-Mart over last month's highway crash that seriously injured him and killed a fellow comedian.", u'Buying pot is normally a subtle affair, but not for Mike Boyer, who camped out to become the first person to legally purchase marijuana in Washington state.', u"Monika Platek, CBC's lead producer for social media during the World Cup, looks at some of the standout moments so far from the 2014 World Cup", u'Our weekly round-up of remarkable photos includes scenes from Brazil, Spain, Germany, India and elsewhere around the world.', u'The European Union said on Saturday that it has extended sanctions to cover 11 leaders of the pro-Moscow rebellion in eastern Ukraine.', u'The Edmonton Public School Board has said it will tell teachers not to use an anti-abortion centre to teach part of its sexual education curriculum, after a McNally high school student filed a human rights complaint over what she was taught.']

您可以将两者结合起来:

import feedparser
d = feedparser.parse("http://www.cbc.ca/cmlink/rss-topstories")
soup = BeautifulSoup("".join([item.summary for item in d.entries]))
print [''.join(s.findAll(text=True)) for s in soup.findAll('p')]
[u"Search teams are returning to the home of Kathy and Alvin Liknes today for another sweep of the property, close to two weeks after the couple and their grandson Nathan O'Brien were discovered missing in Calgary.", u"Israel widened its air assault against the Gaza Strip's Hamas militants on Saturday, hitting targets that included a mosque the Israeli military said was being used to conceal rockets. Meanwhile, there are reports Hamas has launched rockets at Tel Aviv.", u'The Sunni militant group ISIS, which wants to create an Islamic state spanning Iraq and Syria, has issued a recruitment video using the image and words of a dead Ontario man who had become a jihadist and joined the fighting in Syria.', u'A Hamilton-area man\u2019s dashcam may have saved him a pricey car insurance payout \u2013 and maybe even from falling victim to an insurance scam, an industry expert says.', u'Tommy Ramone, a co-founder of the seminal punk band the Ramones and the last surviving member of the original group, has died, a business associate said Saturday.', u"During high-stake police interrogations and on seemingly meaningless online dating profiles, some people find themselves lying. So, how can you tell if someone isn't telling you the truth?", u"Israeli strikes in Gaza have led to sleepless nights and anxious Palestinian children, CBC's Derek Stoffel reports from a refugee camp in Gaza City.", u'Saskatchewan Premier Brad Wall has been a vocal proponent of abolishing the Senate. With the Prime Minister now under pressure to fill vacancies in the upper chamber, Wall argues that not appointing new senators might be the way to get rid of the institution.', u"Bassist Charlie Haden, who helped change the shape of jazz more than a half-century ago as a member of Ornette Coleman's groundbreaking quartet and liberated the bass from its traditional rhythm section role, has died. He was 76.", u"Tracy Morgan has sued Wal-Mart over last month's highway crash that seriously injured him and killed a fellow comedian.", u'Buying pot is normally a subtle affair, but not for Mike Boyer, who camped out to become the first person to legally purchase marijuana in Washington state.', u"Monika Platek, CBC's lead producer for social media during the World Cup, looks at some of the standout moments so far from the 2014 World Cup", u'Our weekly round-up of remarkable photos includes scenes from Brazil, Spain, Germany, India and elsewhere around the world.', u'The European Union said on Saturday that it has extended sanctions to cover 11 leaders of the pro-Moscow rebellion in eastern Ukraine.', u'The Edmonton Public School Board has said it will tell teachers not to use an anti-abortion centre to teach part of its sexual education curriculum, after a McNally high school student filed a human rights complaint over what she was taught.']

【讨论】:

    【解决方案2】:

    我只是导入 re 然后做

    justtheParagraphs = re.findall("<p>(.*?)</p>", yourfeed.entries.content).group(1)
    

    希望这是一个明智的例子。您可以只搜索第一个,但我发现自己想要所有 "&lt;p&gt;(.*?)&lt;/p&gt;" ,然后显示第二个 [.group(1)]。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-07-22
      • 1970-01-01
      • 2014-03-27
      • 1970-01-01
      相关资源
      最近更新 更多