抓取网站以将数据移动到多个 csv 列答案

【问题标题】：scraping site to move data to multiple csv columns抓取网站以将数据移动到多个 csv 列
【发布时间】：2014-05-24 04:40:36
【问题描述】：

将具有多个类别的页面抓取到 csv 中。成功将第一个类别放入一列，但第二列数据未写入 csv。我正在使用的代码：

import urllib2
import csv
from bs4 import BeautifulSoup
url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html"
page = urllib2.urlopen(url)
soup_jazz = BeautifulSoup(page)
all_years = soup_jazz.find_all("td",class_="views-field views-field-year")
all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code")
with open("jazz.csv", 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow([u'Year Won', u'Category'])
    for years in all_years:
        year_won = years.string
        if year_won:
            csv_writer.writerow([year_won.encode('utf-8')])
    for categories in all_category:
        category_won = categories.string
        if category_won:
            csv_writer.writerow([category_won.encode('utf-8')])

它将列标题而不是 category_won 写入第二列。

根据您的建议，我将其编译为：

with open("jazz.csv", 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow([u'Year Won', u'Category'])
for years, categories in zip(all_years, all_category):
    year_won = years.string
    category_won = categories.string
    if year_won and category_won:
        csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])

但我现在收到以下错误：

csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) ValueError: 对已关闭文件的 I/O 操作

【问题讨论】：

标签： python csv beautifulsoup

【解决方案1】：

你可以zip()这两个列表一起：

for years, categories in zip(all_years, all_category):
    year_won = years.string
    category_won = categories.string
    if year_won and category_won:
        csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])

不幸的是，该 HTML 页面有些损坏，您无法像预期的那样搜索表格行。

下一个最好的方法是搜索年份，然后找到兄弟单元：

soup_jazz = BeautifulSoup(page)
with open("jazz.csv", 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow([u'Year Won', u'Category'])
    for year_cell in soup_jazz.find_all('td', class_='views-field-year'):
        year = year_cell and year_cell.text.strip().encode('utf8')
        if not year:
            continue
        category = next((e for e in year_cell.next_siblings
                         if getattr(e, 'name') == 'td' and 
                            'views-field-category-code' in e.attrs.get('class', [])),
                        None)
        category = category and category.text.strip().encode('utf8')
        if year and category:
            csv_writer.writerow([year, category])

这会产生：

Year Won,Category
2012,Best Improvised Jazz Solo
2012,Best Jazz Vocal Album
2012,Best Jazz Instrumental Album
2012,Best Large Jazz Ensemble Album
....
1960,Best Jazz Composition Of More Than Five Minutes Duration
1959,Best Jazz Performance - Soloist
1959,Best Jazz Performance - Group
1958,"Best Jazz Performance, Individual"
1958,"Best Jazz Performance, Group"

【讨论】：

刚刚试了，现在我已经在上面列出了一个错误。
@user1922698：那么您正试图在with 语句的外部运行循环。
但是上面生成的东西一次又一次显示同一个类别，但都是不同的类别。
@user1922698：啊，那是因为那里有一个错误，我在代码中留下了一个测试参考。