【问题标题】:BeautifulSoup: HTML Extracting Bullet points but not navigation barBeautifulSoup:HTML 提取项目符号而不是导航栏
【发布时间】:2019-02-13 14:28:57
【问题描述】:

我正在使用 BeautifulSoup4 进行一些 HTML 抓取。 我正在尝试提取重要信息,例如标题、元数据、段落和列出的信息。

我的问题是我可以使用这样的段落:

def main():
    response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
    html = response.read()
    soup = BeautifulSoup(html,features="html.parser")
    text = [e.get_text() for e in soup.find_all('p')]
    article = '\n'.join(text)


    print(article)

main()

但如果我的网站链接在正文中有项目符号点,它将包括导航栏。即如果我将p 更改为liul

例如我想要得到的输出是:

The Industry Day's objectives are three-fold:

The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.

我实际得到的: The Industry Day's objectives are three-fold:

HTML 源代码中的标签:

<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    您可以使用 Or css 选择器语法,这样您也可以选择 li 元素。

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://ecir2019.org/industry-day/'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    items = [item.text for item in soup.select('p, ol li')]
    
    print(items)
    

    只是那个部分:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://ecir2019.org/industry-day/'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    items = [item.text for item in soup.select('.kg-card-markdown p:nth-of-type(2), .kg-card-markdown p:nth-of-type(2) + ol li')]
    
    print(items)
    

    页面似乎已更改,因此我使用的是缓存版本(这仅在缓存更新之前有效)。您可以使用附加的类选择器限制帖子正文:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'http://webcache.googleusercontent.com/search?q=cache:https://ecir2019.org/industry-day'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    items = [item.text for item in soup.select('.post-body p, .post-body ol li, .post-body ul li')]
    
    print(items)
    

    【讨论】:

    • 谢谢,但可以说有些文本在 ul li&lt;ul&gt; &lt;li&gt;Short company portrait (~100 words)&lt;/li&gt; 内,因为导航栏也在 ul li 内,我怎么能得到这个?
    • 扩展选择器并在末尾添加 , ul li 例如p, ol li, ul, li ...按需定制
    • 但这仍然会抓取导航栏列表不是吗?
    • 我的底部仅抓取您想要的内容,如问题所示。要排除导航,请使用 not: 伪选择器来排除该栏。假设您使用的是最新版本的 bs4。再次使用电脑时会更新。
    • 页面似乎已更改。使用缓存的 google 页面,我限制为使用附加选择器发布正文。好像没问题。
    猜你喜欢
    • 1970-01-01
    • 2015-04-04
    • 1970-01-01
    • 1970-01-01
    • 2019-01-18
    • 1970-01-01
    • 1970-01-01
    • 2018-08-09
    • 1970-01-01
    相关资源
    最近更新 更多