【发布时间】:2018-01-22 19:01:33
【问题描述】:
我正在使用 Python 3 从包含 63,067 个网页的源 URL 的 csv 文件中抓取我创建的 Pandas 数据框。 for 循环应该从项目中抓取新闻文章,然后放入巨大的文本文件中以供稍后清理。
我对 Python 有点生疏了,这个项目是我重新开始用它编程的原因。我以前没有使用过 BeautifulSoup,所以我遇到了一些困难,只是让 for 循环使用 BeautifulSoup 处理 Pandas 数据框。
这是我正在使用的三个数据集之一(另外两个被编程到下面的代码中,以对不同的数据集重复相同的过程,这就是我提到这一点的原因)。
from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd
negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')
negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)
negativeURLS = negativedf[['sourceURL']]
for link in negativeURLS.iterrows():
url = link[1]['sourceURL']
negative = requests.get(url)
negative_content = negative.text
negativesoup = BS(negative_content, "lxml")
for text in negativesoup.find_all('a', href = True):
text.append((text.get('href')))
我想我终于让我的 for 循环工作,让代码运行通过所有源 URL。但是,然后我得到错误:
Traceback (most recent call last):
File "./datacollection.py", line 18, in <module>
negative = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
我知道问题出在我请求 URL 时,但由于数据框中的网页数量正在被迭代,我不确定是什么 - 或者是否是一个 URL 问题。问题是 URL 还是我的 URL 太多,应该使用像 scrapy 这样的不同包?
【问题讨论】:
标签: python pandas web-scraping beautifulsoup python-requests