CSV 中的 EM Dash 导致 Pandas 出现问题答案

【问题标题】：EM Dash in CSV causing issues with PandasCSV 中的 EM Dash 导致 Pandas 出现问题
【发布时间】：2015-10-19 12:56:52
【问题描述】：

我在我的 csv 原始数据文件中遇到了 EM Dash 的问题，导致 Pandas 无法读取 CSV。

我在下面运行了一些变体

    datalocation = filepath
    df = pd.read_csv(datalocation)

收到错误：'UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)'

其他变体包括

    df = pd.read_csv(datalocation, encoding='utf-8')
    df = pd.read_csv(datalocation, encoding='utf-16')

收到错误：'UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte'

    df = pd.read_csv(datalocation, na_values=['—'])

收到错误：'line contains NULL byte'

如果成功，数据框应该类似于下面的示例表。

+---------+------+----------+--------+
| Country | Date | Delivery | Region |
+---------+------+----------+--------+
| a       | —    | 10       | foo    |
| b       | —    | 30       | —      |
| c       | 2    | —50      | foo—   |
| —       | —    | 20       | —bar   |
| a       | —    | 40       | bar—   |
| —       | —    | —6—      | bar    |
| b       | —    | 90—      | foo    |
| c       | —    | 70       | bar    |
| a       | —    | 80       | foo    |
| c       | —    | 100      | foo—   |
+---------+------+----------+--------+

在花时间研究 SO 上的资源后，我了解到这与 Unicode/UTF-8/ASCII 之间的一些冲突有关。

有没有办法在运行“pd.read_csv”之前删除所有 EM Dash？请记住，我不知道 csv 原始文件中所有 EM Dash 的精确单元格。

【问题讨论】：

如果你尝试encoding='iso-8859-1'会发生什么？
@DanielMartin 刚尝试输入并收到此错误：UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

标签： python csv pandas unicode utf-8

【解决方案1】：

终于弄清楚了如何通过在读取数据集之前将数据集预处理到新文件中来做到这一点！希望与同样遇到此问题的任何人分享该方法。

import os, re
import pandas as pd

EMDASH = '—'

with open('scrubbed_file','wt') as outfile:
    with open('original_file_location','rt') as infile:
        for line in infile:
            outfile.write(re.sub(EMDASH,'-',line))

df = pd.read_csv('scrubbed_file', engine='python',
                     encoding='utf_16_le',
                     names=['Country', 'Date', 'Delivery', 'Region'], 
                     delimiter='\t',
                     quotechar='"',
                     skiprows=2, skip_footer=2, thousands = ',')

希望这有助于任何在数据框中遇到问题的人。

【讨论】：