【发布时间】:2019-04-21 22:14:16
【问题描述】:
我正在为我的一个项目创建一个网络爬虫。我确实在网上抓取工作。我能够获得我需要的所有数据。现在我在创建数据框以将其保存到 CSV 文件时遇到问题。
我已经搜索了错误并尝试了许多可能的解决方案,但我不断收到相同的错误。感谢有关代码或错误问题的任何建议。谢谢
ValueError: cannot set a row with mismatched columns
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
max_results_per_city = 30
city_set = ['New+York','Chicago']
columns = ["city", "job_title", "company_name", "location", "summary"]
database = pd.DataFrame(columns = columns)
for city in city_set:
for start in range(0, max_results_per_city, 10):
page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
time.sleep(1)
soup = BeautifulSoup(page.text, "lxml")
for div in soup.find_all(name="div", attrs={"class":"row"}):
num = (len(sample_df) + 1)
job_post = []
job_post.append(city)
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
job_post.append(a["title"])
company = div.find_all(name="span", attrs={"class":"company"})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
for span in sec_try:
job_post.append(span.text)
c = div.findAll('div', attrs={'class': 'location'})
for span in c:
job_post.append(span.text)
d = div.findAll('div', attrs={'class': 'summary'})
for span in d:
job_post.append(span.text.strip())
database.loc[num] = job_post
database.to_csv("test.csv")
【问题讨论】:
-
感谢您的提问,看来您做得很好,有一些事情给您带来了麻烦。主要的是
job_post是一个列表,每个结果的长度可以不同,而 Pandas DataFrames 和 CSV 应该有相同的列数。使用dict或tuple会更合适。
标签: python-3.x pandas dataframe beautifulsoup