在 Python 中使用 BeautifulSoup 从溢出的跨度标签中提取文本答案

【问题标题】：Extracting text from an overflowed span tag using BeautifulSoup in Python在 Python 中使用 BeautifulSoup 从溢出的跨度标签中提取文本
【发布时间】：2020-05-23 19:55:21
【问题描述】：

我对 Python 很陌生，正在赶上一个练习题。在从 HTML 跨度标记中提取文本时，某些部分位于“阅读更多”下，除非我在此处单击它，否则跨度标记不会随外部文本更新。这意味着当我为标签和类运行 BeautifulSoup 和 findAll 时，唯一的第一部分，没有“阅读更多”部分，作为摘录返回。无法弄清楚我应该如何去追求？这是酒店评论的文本挖掘练习。代码如下，未提供完整部分：

url_soup=soup(url_html,"html.parser")
profiles = url_soup.findAll("div",{"class":"hotels-community-tab-common-Card__card--ihfZB hotels-community-tab-common-Card__section--4r93H"})   
for profile in profiles:
     Review_Body = profile.findAll("q",{"class":"location-review-review-list-parts-ExpandableReview__reviewText--gOmRC"})
     Review_Body = Review_Body[0].text.replace(",","").replace("\r\n","").strip(" ")

Page without clicking "read more" Page after clicking "read more", when the entire text till end is visible

如前所述，这只会返回部分，而无需单击“阅读更多”，然后是“...”。请帮忙。 PS：我没有安装和使用 Srapy 或 Selenium 模块。他们会更容易吗？

【问题讨论】：

你有实际链接吗？图片无济于事......您使用哪个库/导入？请求，urllib？
是的，我使用了来自 urllib 的请求。链接是 [link]tripadvisor.in/… [link] 此页面中的所有评论都有动态的“阅读更多”按钮。感谢您的帮助。
你提供的链接给了 tge 完整的数据，我想你使用这个链接：tripadvisor.com/Profile/HollyABC

标签： python html text findall

【解决方案1】：

我认为您正在使用的网站（不是您提供的网站，而是：link 正在调用不同的网站结构，所以不幸的是，我将无济于事。但是，如果它让您继续前进，您可以这样做并更改每次迭代的代码（我很想知道是否有更好的解决方案）。所以再次以防万一：


from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup as _BS

url_html = "https://www.tripadvisor.com/Profile/HollyABC"


def get_web_request(url_to_open: str):
    my_header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
    request = requests.get(url=url_to_open, headers=my_header)
    return request


web_page = get_web_request(url_to_open=url_html)
my_soup = _BS(web_page.text, "lxml")

container_tag = my_soup.find_all('div', attrs={'id': 'content'})
if len(container_tag) > 1:
    exit('Error with the defined container: too many answers(len shoud be one).')
print('len of container_tag', len(container_tag))

row_tags = container_tag[0].find_all('div', attrs={
    'class': 'social-section-core-CardSection__card_section--33UYa ui_card section'})
print('len of rows_tag', len(row_tags))

if row_tags is None or len(row_tags) == 0:
    exit('No result found in container')

href_url_list = []
for row_tag in row_tags:
    # find trip advisor href
    href_tag = row_tag.find_all('a', href=True)
    href = href_tag[2].get('href')
    href_url = urljoin(url_html, href)
    href_url_list.append(href_url)

print(href_url_list)

for href_url in href_url_list:
    web_page = get_web_request(url_to_open=href_url)
    my_soup = _BS(web_page.text, "lxml")
    # assuming it is always the 1st post box...
    text_tag = my_soup.find('div', attrs={'class':'firstPostBox'})
    body_tag = text_tag.find('div', attrs={'class':'postBody'}).find('p')
    print(body_tag.get_text())

因此，您应该获得每家酒店的实际网址，但是您必须为每个不同的网络结构处理问题。我是为第一个做的，但公平起见似乎不是一个很好的解决方案。希望它与您或社区一起滚动。最好的

注意：

我使用“lxml”，您需要进行 pip 安装，但我认为可以使用“html.parser”（这里不是问题）。
我认为 Selenium 不会对问题进行排序，因为一旦单击，您仍然会有不同的 Web 结构 - 一种选择是收集 href/url（就像我一样）以及部分文本，然后在新的最后一个 url 循环中查找部分文本。应该工作。

【讨论】：