【问题标题】:Beautiful Soup Selector returns an empty listBeautiful Soup Selector 返回一个空列表
【发布时间】:2020-07-27 07:54:03
【问题描述】:

所以我在做自动化无聊的东西课程,我试图为自动化无聊的东西书刮亚马逊价格,但无论如何它都会返回一个空字符串,因此在 elems[0] 处发生索引错误].text.strip() 我不知道该怎么办

def getAmazonPrice(productUrl):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'} # to make the server think its a web browser and not a bot
    res = requests.get(productUrl, headers=headers)
    res.raise_for_status()


    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    elems = soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last')
    return elems[0].text.strip()


price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd-ebook/dp/B07VSXS4NK/ref=sr_1_1?crid=30NW5VCV06ZMP&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1586810720&sprefix=automate+the+bo%2Caps%2C288&sr=8-1')
print('The price is ' + price)

【问题讨论】:

  • 另外,下载的 html 中包含的文本说“对不起,我们只需要确保您不是机器人。为了获得最佳效果,请确保您的浏览器接受 cookie。”,有一个提示你。

标签: python html python-3.x beautifulsoup python-requests


【解决方案1】:

您需要将解析器更改为lxml并使用headers = {'user-agent': 'Mozilla/5.0'}

def getAmazonPrice(productUrl):
    headers = {'user-agent': 'Mozilla/5.0'} # to make the server think its a web browser and not a bot
    res = requests.get(productUrl, headers=headers)
    res.raise_for_status()


    soup = bs4.BeautifulSoup(res.text, 'lxml')
    elems = soup.select_one('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last')
    return elems.text.strip()


price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd-ebook/dp/B07VSXS4NK/ref=sr_1_1?crid=30NW5VCV06ZMP&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1586810720&sprefix=automate+the+bo%2Caps%2C288&sr=8-1')
print('The price is ' + price)

快照


如果你想使用选择然后

def getAmazonPrice(productUrl):
    headers = {'user-agent': 'Mozilla/5.0'} # to make the server think its a web browser and not a bot
    res = requests.get(productUrl, headers=headers)
    res.raise_for_status()


    soup = bs4.BeautifulSoup(res.text, 'lxml')
    elems = soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last')
    return elems[0].text.strip()


price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd-ebook/dp/B07VSXS4NK/ref=sr_1_1?crid=30NW5VCV06ZMP&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1586810720&sprefix=automate+the+bo%2Caps%2C288&sr=8-1')
print('The price is ' + price)

试试这个。

def getAmazonPrice(productUrl):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}  # to make the server think its a web browser and not a bot
    res = requests.get(productUrl, headers=headers)
    res.raise_for_status()


    soup = bs4.BeautifulSoup(res.text, 'lxml')
    elems = soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last')
    return elems[0].text.strip()


price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd-ebook/dp/B07VSXS4NK/ref=sr_1_1?crid=30NW5VCV06ZMP&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1586810720&sprefix=automate+the+bo%2Caps%2C288&sr=8-1')
print('The price is ' + price)

【讨论】:

  • 我收到此错误:bs4.FeatureNotFound:找不到具有您请求的功能的树生成器:lxml。您需要安装解析器库吗?然后我尝试安装 lxml 模块并导入它,现在我得到 return elems.text.strip() AttributeError: 'NoneType' object has no attribute 'text'
  • @MahdeenSky :您需要使用pip install lxml 安装它
  • 我做了,但后来我收到一个新错误:AttributeError: 'NoneType' object has no attribute 'text'
  • @Buddy 检查我的代码:它是 select_one() 不是 select()
  • 我复制粘贴了那个确切的代码,但它不工作我仍然得到属性错误
【解决方案2】:

您的请求将触发亚马逊的 503 错误。也许是由于亚马逊的反刮削努力。所以也许你应该考虑一些其他的方法。

import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'} # to make the server think its a web browser and not a bot

productUrl = 'https://www.amazon.com/Automate-Boring-Stuff-Python-2nd-ebook/dp/B07VSXS4NK/ref=sr_1_1?crid=30NW5VCV06ZMP&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1586810720&sprefix=automate+the+bo%2Caps%2C288&sr=8-1'

res = requests.get(productUrl, headers=headers)

print (res)

输出:

<Response [503]>

【讨论】:

  • 你能给我一个我可以抓取的网站的例子吗?因为我卡在这一点上,在继续课程之前我需要一些有用的东西
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-11
  • 2022-12-03
  • 2013-03-27
  • 1970-01-01
  • 2020-12-16
  • 2021-09-07
相关资源
最近更新 更多