【发布时间】:2019-02-13 14:28:57
【问题描述】:
我正在使用 BeautifulSoup4 进行一些 HTML 抓取。 我正在尝试提取重要信息,例如标题、元数据、段落和列出的信息。
我的问题是我可以使用这样的段落:
def main():
response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
html = response.read()
soup = BeautifulSoup(html,features="html.parser")
text = [e.get_text() for e in soup.find_all('p')]
article = '\n'.join(text)
print(article)
main()
但如果我的网站链接在正文中有项目符号点,它将包括导航栏。即如果我将p 更改为li 或ul
例如我想要得到的输出是:
The Industry Day's objectives are three-fold:
The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.
我实际得到的:
The Industry Day's objectives are three-fold:
HTML 源代码中的标签:
<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>
【问题讨论】:
标签: python html web-scraping beautifulsoup