您不想搜索每个标签,您可以选择包含文本和过滤器的跨度标签,您可以使用css selector 来选择标签。你要的是span class="Text Intro Justify"里面的文字:
base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get(base_url).content)
text = [t.text for t in soup.select('div span.Text.Intro.Justify') if "agricultural" in t.text]
这会给你:
['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']
如果你想匹配不区分大小写,你需要if "agricultural" in t.text.lower()
此外,如果您想要完全匹配,则需要拆分文本或使用带有单词边界的正则表达式,否则您最终可能会误报某些单词。
soup = BeautifulSoup(requests.get(base_url).content)
import re
# look for exact word
r = re.compile(r"\bagricultural\b", re.I)
text = [t.text for t in soup.find_all('span', {"class":'Text.Intro Justify'},text=r) ]
使用re.I 将匹配agricultural 和Agricultural。
使用单词边界意味着如果字符串包含"foobar",则不会匹配"foo"。
无论您采用何种方法,一旦您知道要搜索的特定标签,您就应该只搜索那些标签,搜索每个标签可能意味着您得到的匹配项与您真正想要的完全无关。
如果你有很多解析需要像上面那样通过文本过滤,你可能会发现lxml非常强大,使用xpath expression我们可以很容易地过滤:
base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"
from lxml.etree import fromstring, HTMLParser
import requests
r = requests.get(base_url).content
xml = fromstring(r, HTMLParser())
print(xml.xpath("//span[@class='Text Intro Justify' and contains(text(),'agricultural')]//text()"))
这给了你:
['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']
对于与 xpath 的大小写匹配,我们需要将 A 转换为 a:
(xml.xpath("//span[@class='Text Intro Justify' and contains(translate(text(), 'A','a'), 'agricultural')]//text()")
您看到的\u201 是“ 的repr 输出,当您实际打印字符串时,您将看到str 输出。
In [3]: s = u"Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture."
In [4]: print(s)
Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.