【问题标题】:Python scraping with BeautifulSoup, Only scrape paragraphs with certain word in itPython 用 BeautifulSoup 抓取,只抓取带有特定单词的段落
【发布时间】:2016-07-22 19:37:50
【问题描述】:

所以我能够从下面的代码中抓取整章的法规。但是,可以说我是否只想刮掉带有“农业”一词的段落。我该怎么做?

from bs4 import BeautifulSoup
import requests
import re

f = open('C:\Python27\projects\Florida\FL_finalexact.doc','w')

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter{chapter:03d}/All"

for chapter in range (1,40):  
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:   
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
     for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write ('\n\n' + title.text + '\n\n' )

     for data in tableContents.find_all ('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n" + str(data)+ "\n" 
      f.write(data)

我是否需要为此任务使用正则表达式?

【问题讨论】:

  • 您的代码语法无效。
  • 在 for 循环后给出正确的标识和冒号。
  • 哪一部分无效?
  • 尝试将其复制粘贴到文本文件中并执行。你会知道的
  • 抱歉,缺少一行代码,已修复

标签: regex python-2.7 web-scraping beautifulsoup


【解决方案1】:

您不需要正则表达式。 BeautifulSoup 比这更强大:

soup = BeautifulSoup(r.content)
soup.find_all(lambda tag: "agricultural" in tag.string if tag.string else False)

足以为您提供包含“农业”一词的所有元素的列表。然后您可以遍历列表并提取相关字符串:

results = soup.find_all(...) # function as before
scraped_paragraphs = map(lambda element: element.string, results)

然后将元素写入scraped_paragraphs 中的任意位置。

这是如何工作的

BeautifulSoup 支持find_all() 功能,该功能将返回与输入到find_all() 的特定标准匹配的所有标签。此标准可以采用正则表达式、函数、列表甚至True 的形式。在这种情况下,一个合适的布尔函数就足够了。

然而,更重要的是,soup 中的每个 HTML 标记都由各种属性索引。您可以在 HTML 标记中查询属性、子代、兄弟姐妹,当然还有由string 标记的包含的内部文本。

这个解决方案所做的只是过滤所有那些string 中包含“农业”的元素的解析HTML。因为不是每个元素都有一个string属性,所以有必要首先检查它是否有一个 - 因此我们为什么要使用if tag.string并在没有找到时返回False

一个例子

这是Chapter001 的样子:

soup.find_all(lambda tag: "agricultural" in tag.string if tag.string else False)
>>>> [<span class="Text Intro Justify" xml:space="preserve">Crude turpentine gum (oleoresin), the product of a living tree or trees of the
     pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural 
     products, farm products, and agricultural commodities.</span>, 
     <span class="Text Intro Justify" xml:space="preserve">Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or 
     words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; 
     aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; 
     and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.
     </span>]

results 上调用map 函数会产生没有伴随span 元素和讨厌属性的内部字符串:

map(lambda element : element.string, soup.find_all(...)
>>>> [u'Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', 
      u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

【讨论】:

  • 感谢您的补充说明。代码运行完美。
  • lambda 函数到底是做什么的?
  • 一个 lambda 函数是一个匿名函数——它允许你在一行中定义函数而不需要命名它。就它在这里的作用而言,soup.find_all() 遍历其所有标签并将tag 传递给 lambda 函数。如果函数返回True,soup 会保留它 - 如果它返回 False,soup 继续。
  • @TianMa:这是关于 lambda 函数的nice tutorial
  • re.compile 通常与 find 和 find_all 一起使用。使用 lambda 可能会很慢,如果 OP 想要完全匹配,使用 in 也会返回误报
【解决方案2】:

您不想搜索每个标签,您可以选择包含文本和过滤器的跨度标签,您可以使用css selector 来选择标签。你要的是span class="Text Intro Justify"里面的文字:

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get(base_url).content)

text = [t.text for t in soup.select('div span.Text.Intro.Justify') if "agricultural" in t.text]

这会给你:

['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

如果你想匹配不区分大小写,你需要if "agricultural" in t.text.lower()

此外,如果您想要完全匹配,则需要拆分文本或使用带有单词边界的正则表达式,否则您最终可能会误报某些单词。

soup = BeautifulSoup(requests.get(base_url).content)
import re

# look for exact word
r = re.compile(r"\bagricultural\b", re.I)
text = [t.text for t in soup.find_all('span', {"class":'Text.Intro Justify'},text=r) ]

使用re.I 将匹配agriculturalAgricultural

使用单词边界意味着如果字符串包含"foobar",则不会匹配"foo"

无论您采用何种方法,一旦您知道要搜索的特定标签,您就应该只搜索那些标签,搜索每个标签可能意味着您得到的匹配项与您真正想要的完全无关。

如果你有很多解析需要像上面那样通过文本过滤,你可能会发现lxml非常强大,使用xpath expression我们可以很容易地过滤:

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"

from lxml.etree import fromstring, HTMLParser
import requests
r = requests.get(base_url).content
xml = fromstring(r, HTMLParser())

print(xml.xpath("//span[@class='Text Intro Justify' and contains(text(),'agricultural')]//text()"))

这给了你:

['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

对于与 xpath 的大小写匹配,我们需要将 A 转换为 a:

(xml.xpath("//span[@class='Text Intro Justify' and  contains(translate(text(), 'A','a'), 'agricultural')]//text()")

您看到的\u201repr 输出,当您实际打印字符串时,您将看到str 输出。

In [3]: s = u"Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture."

In [4]: print(s)
Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.

【讨论】:

  • 我也会试试这个方法,然后回复你。
  • 我注意到的一件事是,在您的打印结果中,有 u' 和 \u201、\u201d 与文本混合在一起,我认为这些是编码问题?我试图用 unf-8 对它们进行编码,但结果出错。
  • @TianMa,这只是您看到的repr 输出,当您打印实际字符串时,您将看到正确的输出,我将其添加到答案中。
  • 所以我注意到你说如果我想忽略大小写我需要使用in t.text.lower(),这是否意味着“农业”这个词既可以大写也可以小写?
  • 如果我想同时捕获“农业”和“农业”,我应该使用正则表达式吗?如何使它不区分大小写。
猜你喜欢
  • 1970-01-01
  • 2021-11-20
  • 2016-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-02-23
  • 2020-03-27
  • 1970-01-01
  • 2013-09-28
相关资源
最近更新 更多