【发布时间】:2017-09-09 01:57:57
【问题描述】:
为了学习scrapy,我正在尝试从start_urls 列表中抓取一些内部网址。问题是并非所有来自start_urls 的元素都有内部urls(这里我想返回NaN)。因此,我怎样才能返回以下 2 列数据框(**):
visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com
到目前为止,我尝试过:
在:
# -*- coding: utf-8 -*-
class ToySpider(scrapy.Spider):
name = "toy_example"
allowed_domains = ["www.example.com"]
start_urls = ['https:example1.com',
'https:example2.com',
'https:example3.com']
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")
lis_ = []
for l in links:
item = ToyCrawlerItem()
item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
但是,上面的代码返回了我:
输出:
extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com
我尝试通过以下方式管理 None 问题值:
if l == None:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
但它不起作用,知道如何获得(**)
*是一个dataframe,我知道我可以做-o,但是我会做dataframe操作。
更新
阅读@rrschmidt 的回答后,我尝试:
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
lis_ = []
for l in links:
item = ToyItem()
if len(l) == 0:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
#item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
print('\n\n\n Aqui:\n\n', item, "\n\n\n")
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
尽管如此,它仍然返回了相同的错误输出。有人可以帮我澄清这个问题吗?
【问题讨论】:
标签: python pandas beautifulsoup scrapy web-crawler