如何使用网页抓取来获取网页上的可见文本？答案

【问题标题】：How to use web scraping to get visible text on the webpage?如何使用网页抓取来获取网页上的可见文本？
【发布时间】：2021-04-08 03:52:29
【问题描述】：

这是我要抓取的网页链接： https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html

我还应用了其他过滤器，方法是点击带圆圈的标题1

这是点击标题2后网页的样子

我想获取网页上显示的所有地点的名称，但我似乎遇到了麻烦，因为在应用过滤器时 url 没有改变。我为此使用python urllib。这是我的代码：

url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

【问题讨论】：

您可以尝试调用 Tripadvisor API developer-tripadvisor.com/content-api 获取结果

标签： python html python-3.x web-scraping urllib

【解决方案1】：

您可以使用 bs4。 Bs4 是一个 python 模块，允许您从网页中获取某些内容。这将从网站获取文本：

from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)

如果你想得到不是文本的东西，也许是带有特定标签的东西，你也可以使用 bs4：

soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title

找出所有地名有什么类和标签，然后用上面的方法得到所有地名。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

【讨论】：