Web Scraping 代码中的 JSON 错误，如何修复？答案

【问题标题】：JSON Error in Web Scraping code, How to fix?Web Scraping 代码中的 JSON 错误，如何修复？
【发布时间】：2020-06-18 19:41:30
【问题描述】：

我正在尝试使用此代码从消费者事务评论网站收集评论。但我不断收到错误，特别是在 dateElements & jsonData 部分。有人可以帮我修复此代码以与我要抓取的网站兼容吗？

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []

    link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('div', {'class':'rvw js-rvw'})
    for article in articles:
        names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
        try:
            bodies.append(article.find('p', attrs={'class':'rvw-bd'}).text.strip())
        except:
            bodies.append('')

        try:
            ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
        except:
            ratings.append('')
        dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()

        jsonData = json.loads(dateElements)
        published.append(jsonData['publishedDate'])
        updated.append(jsonData['updatedDate'])
        reported.append(jsonData['reportedDate'])


    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('pass1')


df.to_csv('AllureReviews.csv', index=False, encoding='utf-8')
print ('excel done')

这是我遇到的错误

Traceback（最近一次通话最后一次）：文件“C:/Users/Sara Jitkresorn/PycharmProjects/untitled/venv/Caffairs.py”，第 37 行，在 jsonData = json.loads(dateElements) 文件 "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json__init__.py", 第 348 行，在负载中返回 _default_decoder.decode(s) 文件“C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py”，第 337 行，在解码中 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 文件“C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", 第 355 行，在 raw_decode 提高 JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

【问题讨论】：

标签： python json pandas web-scraping beautifulsoup

【解决方案1】：

dateElements 不包含可以被json.loads() 解析的字符串，因为它只是一个文本字符串，例如Original review: Feb. 15, 2020

更改这些行以规避此问题：

try:
    ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
except:
    ratings.append('')
dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()

published.append(dateElements)

temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published})
df = df.append(temp_df, sort=False).reset_index(drop=True)

您还必须注释掉这两行：

# updated = []
# reported = []

尽管您仍然没有获得Body 和Rating 的数据，但您的代码运行时没有错误。

df 打印到这个：

    User Name   Body    Rating  Published Date
0   M. M. of Dallas, GA             Original review: Feb. 15, 2020
1   Malinda of Aston, PA            Original review: Sept. 21, 2019
2   Ping of Tarzana, CA             Original review: July 18, 2019

【讨论】：

嗯。是因为我在身体和等级上弄错了班级吗？我没有得到那些数据。
Body 您可以通过在相应的代码行中将p 标记更改为div 来修复。使用这个：bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip())。然后你得到身体数据。
如果您使用请求进行抓取，您应该查看网站的原始 HTML 以正确识别标签、ID 和类，而不是在呈现的页面中。
所以我现在能够获取身体的数据。但是随后在 excel 中收集的数据包含很多重复项。在评论网站上，有 86 条评论。但我得到了 630 行数据/评论。代码中的哪一部分导致正在收集的评论重复？
我建议您将其作为一个新问题发布，并使用您目前拥有的修改后的代码。如果您只想摆脱重复项，可以使用 Pandas 的drop_duplicates()。

【解决方案2】：

除了上面的代码，我们可以得到评分和非重复数据如下：-

from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []
    dateElements = []

    link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('div', {'class':'rvw js-rvw'})
    for article in articles:
        names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
        try:
            bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip())
        except:
            bodies.append('NA')

        try:
            ratings.append(article.find('meta', attrs={'itemprop': 'ratingValue'})['content'])
        except:
            ratings.append('NA')
        dateElements.append(article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip())
    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': dateElements})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('df')

【讨论】：

您能快速解释一下您所做的更改吗？
当然@petezurich 之前的例外条件没有将任何元素附加到列表中。因此，当返回空结果时，它会破坏下一个值。因此，NA 在这种情况下会有所帮助。此外，评级类选择了错误的元素，因此必须使用不同的属性找到它。这些有助于构建正确的数据框。