来自亚马逊网站的网页抓取导致 HTTP 错误答案

【问题标题】：Web Scraping from Amazon website is giving HTTP Error来自亚马逊网站的网页抓取导致 HTTP 错误
【发布时间】：2019-03-13 10:09:36
【问题描述】：

我正在使用 Python：3.7.1 版本并使用它，我想对亚马逊网站上的 I-Phone 用户 cmets（或客户评论）进行网络抓取（链接如下）。

链接（待抓取）： https://www.amazon.in/Apple-iPhone-Silver-64GB-Storage/dp/B0711T2L8K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1548335262&sr=1-1&keywords=iphone+X

当我尝试下面的代码时，它给了我以下错误：

代码：

# -*- coding: utf-8 -*-

#import the library used to query a website
import urllib.request         
from bs4 import BeautifulSoup  

#specify the url
scrap_link = "https://www.amazon.in/Apple-iPhone-Silver-64GB-Storage/dp/B0711T2L8K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1548335262&sr=1-1&keywords=iphone+X"
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"

#Query the website and return the html to the variable 'page'
page = urllib.request.urlopen(scrap_link) 
#page = urllib.request.urlopen(wiki) 
print(page)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)

print(soup.prettify())

错误：

  File "C:\Users\bsrivastava\AppData\Local\Continuum\anaconda3\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: Service Unavailable

注意：当我尝试抓取 wiki 链接（显示在代码中）时，它工作正常。

那么为什么我在使用亚马逊链接时会收到这个错误，我该如何克服呢？

此外，当我获得此客户评论数据时，我需要将其以结构化格式存储，如下所示。我该怎么做？（我对 NLP 完全陌生，所以在这里需要一些指导）

 Structure:
a. Reviewer’s Name 
b. Date of review 
c. Color 
d. Size 
e. Verified Purchase (True or False) 
f. Rating 
g. Review Title 
h. Review Description

【问题讨论】：

问题与machine-learning 无关 - 请不要向标签发送垃圾邮件（已删除）。
嗨，我正试图对这些数据进行情绪分析，因此错误地投入了机器学习。谢谢指正。

标签： python-3.x web-scraping beautifulsoup nlp

【解决方案1】：

自然语言处理？你确定吗？

import requests         
from bs4 import BeautifulSoup  


scrap_link = "https://www.amazon.in/Apple-iPhone-Silver-64GB-Storage/dp/B0711T2L8K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1548335262&sr=1-1&keywords=iphone+X"

req = requests.get(scrap_link)
soup = BeautifulSoup(req.content, 'html.parser')
container = soup.findAll('div', attrs={'class':'a-section review aok-relative'})
data = []
for x in container:
    ReviewersName = x.find('span', attrs={'class':'a-profile-name'}).text
    data.append({'ReviewersName':ReviewersName})
print(data)
#later save the dictionary to csv

【讨论】：