使用 Spark SQL 收集时如何处理列内容中的非 ascii 字符？答案

【问题标题】：How to handle non-ascii characters in content of columns while collecting using Spark SQL?使用 Spark SQL 收集时如何处理列内容中的非 ascii 字符？
【发布时间】：2018-02-16 04:58:08
【问题描述】：

我有一个要求，我需要将一些列收集到 Spark 驱动程序中，并且一些列包含非 ascii 字符。但是在收集它们时会出错：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 187: ordinal not in range(128).

知道如何在获取列内容时将 udf 应用于列内容，然后将其收集到驱动程序中吗？

我为此使用 PySpark。

【问题讨论】：

如何读取数据？如果您从文件中读取它们，您可以在读取时将编码定义为 utf-8
我正在从 Hive 读取数据。

标签： apache-spark utf-8 pyspark apache-spark-sql ascii

【解决方案1】：

我遇到了同样的问题。这对我有用：

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

我在这里找到它https://chase-seibert.github.io/blog/2014/01/12/python-unicode-console-output.html

【讨论】：