在 Python 中使用 json.loads 时如何处理来自 CSV 的非 ascii 字符？答案

【问题标题】：How do I deal with non ascii character from CSV when using json.loads in Python?在 Python 中使用 json.loads 时如何处理来自 CSV 的非 ascii 字符？
【发布时间】：2017-06-27 03:29:48
【问题描述】：

我查看了一些答案，包括 this，但似乎没有一个回答我的问题。

以下是 CSV 中的一些示例行：

_id category
ObjectId(56266da778d34fdc048b470b)  [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}]
ObjectId(56266e0c78d34f22058b46de)  [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}]

这是我的代码：

import csv
import sys

from sys import argv
import json


def ReadCSV(csvfile):
with open('newCSVFile.csv','wb') as g:
    filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    with open(csvfile, 'rb') as f:
        reader = csv.reader(f) # ceate reader object
        next(reader) # skip first row

        for row in reader: #go trhough all the rows
            listForExport = [] #initialize list that will have two items: id and list of categories

            # ID section
            vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv
            vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases
            listForExport.append(vendorId) #add evendor ID to first item in list


            # categories section
            tempCatList = []  #temporarly list of categories for scond item in listForExport

            #this is line 41 where the error stems
            categories = json.loads(row[1]) #create's a dict with the categoreis from a given row

            for names in categories:  # loop through the categorie names using the key 'name'

                print names['name']

这是我得到的：

Cleaning Services
Traceback (most recent call last):
  File "csvtesting.py", line 57, in <module>
    ReadCSV(csvfile)
  File "csvtesting.py", line 41, in ReadCSV
    categories = json.loads(row[1]) #create's a dict with the categoreis from a given row
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte

所以代码提取了第一个类别Cleaning Services，但是当我们到达非ascii字符时就失败了。

我该如何处理？我很高兴只删除所有非 ascii 项目。

【问题讨论】：

你试过your_string.encode('unicode_escape').decode('utf-8', 'ignore')吗？
没有。我会把它放在代码的什么地方？
我猜在这种情况下，your_string 就是names['name']。
在 csv.reader 中不需要传递 delimeter=' ' 吗？
@Coldspeed 错误源于categories = json.loads(row[1])上方的两行

标签： python json unicode

【解决方案1】：

当您以rb 模式打开输入 csv 文件时，我假设您使用的是 Python2.x 版本。好消息是您在 csv 部分没有问题，因为 csv 阅读器将读取纯字节而不尝试解释它们。但是json 模块会坚持将文本解码为 unicode，并且默认使用 utf8。由于您的输入文件不是 utf8 编码，因此会阻塞并引发 UnicodeDecodeError。

Latin1 有一个很好的属性：任何字节的 unicode 值就是该字节的值，所以你肯定能解码任何东西——它是否有意义然后取决于实际的编码是 Latin1...

所以你可以这样做：

categories = json.loads(row[1], encoding="Latin1")

或者，如果您想忽略非 ascii 字符，您可以先将字节字符串转换为 unicode 忽略错误，然后才加载 json：

categories = json.loads(row[1].decode(errors='ignore))     # ignore all non ascii characters

【讨论】：

太棒了！！谢谢你的帮助。我选择了ignore，因为这些字符稍后会在代码中产生问题。

【解决方案2】：

您的 csv 内容中很可能包含某些非 ascii 字符。

import re

def remove_unicode(text):
    if not text:
        return text

    if isinstance(text, str):
        text = str(text.decode('ascii', 'ignore'))
    else:
        text = text.encode('ascii', 'ignore')

    remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')

    return remove_ctrl_chars_regex.sub('', text)

...
vendorId = remove_unicode(row[0])
...
categories = json.loads(remove_unicode(row[1]))

【讨论】：

试过了。现在得到以下错误UnicodeDecodeError: 'utf8' codec can't decode bytes in position 67-68: invalid continuation byte 引用以下行for row in reader:。
我认为你的csv中除了unicode之外还有一些其他的字符，为什么不把它们都删除呢？
我很乐意。我该怎么做？
问题只是输入编码不是unicode，所以我无法想象unicodecsv模块如何解决它......
@dwstein 我认为阅读整个文件并删除非 ascii 字符可能很麻烦。阅读时忽略非ASCII字符。我已经更新了我的答案。看看有没有帮助。