Scrapy 从 CSV 中删除空行答案

【问题标题】：Scrapy Drop Empty Rows From CSVScrapy 从 CSV 中删除空行
【发布时间】：2020-08-13 07:05:35
【问题描述】：

在 pandas 的帮助下，我编写了一个抓取表格并返回 csv 的 scrapy 脚本。但是，最终的 csv 总是有几个空行，我必须手动删除。

import scrapy
import pandas as pd

class XGSpider(scrapy.Spider):

    name = 'expectedGoals'

    start_urls = [
        'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures',
    ]

    def parse(self, response):

        matches = []

        for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):

            match = {
                'home': row.xpath('td[4]//text()').extract_first(),
                'homeXg': row.xpath('td[5]//text()').extract_first(),
                'score': row.xpath('td[6]//text()').extract_first(),
                'awayXg': row.xpath('td[7]//text()').extract_first(),
                'away': row.xpath('td[8]//text()').extract_first()
            }

            matches.append(match)

        x = pd.DataFrame(
            matches, columns=['home', 'homeXg', 'score', 'awayXg', 'away'])

        yield x.to_csv("xG.csv", sep=",", index=False)

使用x pandas 数据框，我尝试过x.dropna()，但这似乎并没有删除任何空值。以下是数据框前 15 行的示例：

print(x.head(15))

              home homeXg score awayXg             away
0        Liverpool    1.8   4–1    1.0     Norwich City
1         West Ham    0.8   0–5    3.0  Manchester City
2          Burnley    0.6   3–0    0.9      Southampton
3          Watford    1.0   0–3    0.6         Brighton
4      Bournemouth    1.1   1–1    1.0    Sheffield Utd
5   Crystal Palace    0.7   0–0    0.8          Everton
6        Tottenham    2.4   3–1    0.7      Aston Villa
7    Newcastle Utd    0.5   0–1    0.9          Arsenal
8   Leicester City    0.6   0–0    0.7           Wolves
9   Manchester Utd    2.3   4–0    0.9          Chelsea
10            None   None  None   None             None
11         Arsenal    0.9   2–1    1.4          Burnley
12     Southampton    1.6   1–2    1.5        Liverpool
13    Norwich City    1.5   3–1    0.7    Newcastle Utd
14        Brighton    1.8   1–1    0.8         West Ham

我认为 match 对象正在返回 None 用于任何已被抓取的空值。你知道如何生成没有空行的最终 csv 吗？

【问题讨论】：

标签： python pandas csv scrapy

【解决方案1】：

您可以先使用 Nan 替换所有 None 值，（如果它是对象 None 而不是字符串 'None'）

import numpy as np
import pandas as pd

x = x.fillna(value=np.nan)

然后使用删除所有具有空数据的行

x.dropna(

    axis=0,
    how='all', #use 'any' if you want remove rows with even one empty value
    inplace=True

)

Read more about dropna() here

【讨论】：

【解决方案2】：

你只需要跳过带有None值的match，例如添加这个条件：

if match['home']:
    matches.append(match)

输出：

              home homeXg score awayXg             away
0        Liverpool    1.8   4-1    1.0     Norwich City
1         West Ham    0.8   0-5    3.0  Manchester City
2          Burnley    0.6   3-0    0.9      Southampton
3          Watford    1.0   0-3    0.6         Brighton
4      Bournemouth    1.1   1-1    1.0    Sheffield Utd
5   Crystal Palace    0.7   0-0    0.8          Everton
6        Tottenham    2.4   3-1    0.7      Aston Villa
7    Newcastle Utd    0.5   0-1    0.9          Arsenal
8   Leicester City    0.6   0-0    0.7           Wolves
9   Manchester Utd    2.3   4-0    0.9          Chelsea
10         Arsenal    0.9   2-1    1.4          Burnley
11     Southampton    1.6   1-2    1.5        Liverpool
12    Norwich City    1.5   3-1    0.7    Newcastle Utd
13        Brighton    1.8   1-1    0.8         West Ham
14         Everton    0.9   1-0    1.1          Watford

【讨论】：

【解决方案3】：

只需从 XPath 中排除 spacer 行（它们有 class="spacer partial_table"）：

for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr[not(@class)]'):

【讨论】：