无法从网站上抓取特定内容 - BeautifulSoup 4答案

【问题标题】：Cannot scrape specific content from site - BeautifulSoup 4无法从网站上抓取特定内容 - BeautifulSoup 4
【发布时间】：2015-02-01 11:42:10
【问题描述】：

我很难通过 Python 3、BeautifulSoup 4 抓取此链接

http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining

我只想得到这个部分。

When you are in ...

Capitol City Grille
This downtown Lansing restaurant offers ...

Capitol City Grille Lounge
For a glass of wine or a ...

Room Service
If you prefer ...

我有这个代码

 for rest in dining_page_soup.select("div.copy_left p strong"):

      if rest.next_sibling is not None:
         if rest.next_sibling.next_sibling is not None:
               title = rest.text
               desc = rest.next_sibling.next_sibling
               print ("Title:  "+title)
               print (desc)

但它给了我TypeError: 'NoneType' object is not callable

在desc = rest.next_sibling.next_sibling 上，即使我有一个if 语句来检查它是否是None。

【问题讨论】：

尝试使用if not rest.next_sibling is None 和if not rest.next_sibling.next_sibling is None 代替上面的两个if 语句，看看你是否得到了一些有用的提示？
你能把它发布为答案
抱歉，这里的社区会因为发布我不确定的任何答案而在此处投反对票，所以.. 只需将for rest in dining_page_soup.select("div.copy_left p strong"): 之后的两个 if 条件替换为上述评论中的 if 条件订购
还是一样的错误
尝试将title = rest.text 行移到desc = rest.next_sibling.next_sibling 行下方？

标签： python python-3.x beautifulsoup

【解决方案1】：

这是一个非常简单的解决方案

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining")
data = r.text
soup = BeautifulSoup(data)
for found_text in soup.select('div.copy_left'):
    print found_text.text

更新

根据问题的改进，这里是使用 RE 的解决方案。必须为第 1 段“当你...”制定特定的解决方法，因为它不尊重其他段落的结构。

for tag in soup.find_all(re.compile("^strong")):

    title = tag.text
    desc = tag.next_sibling.next_sibling
    print ("Title:  "+title)
    print (desc)

输出

标题：Capitol City Grille

这家位于兰辛市中心的餐厅提供美味的现代美食在高档而轻松的环境中享用美式菜肴。你可以享受从松软的煎饼到多汁的菲力牛排，应有尽有。提供自助早餐和午餐，以及单点菜肴菜单。

标题：Capitol City Grille Lounge

想要一杯葡萄酒或一杯手工调制的鸡尾酒和愉快的交谈，在 Capitol City Grille 酒廊度过一个下午或晚上朋友或同事。

标题：客房服务

如果您喜欢在自己舒适的房间内用餐，请从客房服务菜单。

标题：菜单

早餐菜单

标题：Capitol City Grille Hours

早餐，早上 6:30-11 点

标题：Capitol City Grille Lounge Hours

周一至周四，上午 11 点至晚上 11 点

标题：客房服务时间

每天上午 6:30 至下午 2:00和下午 5 点到 10 点

【讨论】：

我不想一次刮掉它...我想分别获取 Title 和 Description 以便稍后将其存储到 DB 中
我对你投了反对票并阅读了我的评论......当你回答时，刮掉它是没有用的
@Umair 可能是您需要改进您的问题而不是投反对票。既然你要求：“我只想得到这个部分。”您可以分别添加有关抓取标题和描述的段落，以便我以后可以存储到数据库中...
您可以在我的问题中看到我将 Title 和 Desc 存储在单独的变量中......这是有原因的
@Umair 另一种使用 Re 的方法

【解决方案2】：

如果你不介意使用 xpath，这应该可以工作

import requests
from lxml import html

url = "http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining"
page = requests.get(url).text
tree = html.fromstring(page)

xp_t = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/text()"
xp_d = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()[not(following-sibling::strong)]"

titles = tree.xpath(xp_t)
descriptions = tree.xpath(xp_d)  # still contains garbage like '\r\n'
descriptions = [d.strip() for d in descriptions if d.strip()]

for t, d in zip(titles, descriptions):
    print("{title}: {description}".format(title=t, description=d))

这里的描述包含 3 个元素：“This Downtown...”、“For a glass...”、“If you prefer...”。

如果您还需要“当你心情好的时候...”，请替换为：

xp_d = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()"

【讨论】：