Python Reddit bot 未正确编码特殊字符答案

【问题标题】：Python Reddit bot not correctly encoding special charactersPython Reddit bot 未正确编码特殊字符
【发布时间】：2015-08-05 14:58:41
【问题描述】：

我有一个 Reddit 机器人，它试图将 ASCII 文本转换为图像。根据this issue，我遇到了编码特殊字符的问题。

我有一个repo 专门用于这个项目，但为了简洁起见，我将发布相关代码。我尝试切换到 Python 3（因为我听说它比 Python 2 更优雅地处理 Unicode），但这并没有解决问题。

此函数从 Reddit 中提取 cmets。如您所见，我一拉取所有内容就将其编码为utf-8，这就是我感到困惑的原因。

def comments_by_keyword(r, keyword, subreddit='all', print_comments=False):
    """Fetches comments from a subreddit containing a given keyword or phrase
    Args:
        r: The praw.Reddit class, which is required to access the Reddit API
        keyword: Keep only the comments that contain the keyword or phrase
        subreddit: A string denoting the subreddit(s) to look through, default is 'all' for r/all
        limit: The maximum number of posts to fetch, increase for more thoroughness at the cost of increased redundancy/running time
        print_comments: (Debug option) If True, comments_by_keyword will print every comment it fetches, instead of just returning filtered ones
    Returns:
        An array of comment objects whose body text contains the given keyword or phrase
    """

    output = []
    comments = r.get_comments(subreddit, limit=1000)

    for comment in comments:
        # ignore the case of the keyword and comments being fetched
        # Example: for keyword='RIP mobile users', comments_by_keyword would keep 'rip Mobile Users', 'rip MOBILE USERS', etc.
        if keyword.lower() in comment.body.lower():
            print(comment.body.encode('utf-8'))
            print("=====\n")
            output.append(comment)
        elif print_comments:
            print(comment.body.encode('utf-8'))
            print("=====\n")
    return output

然后将其转换为图像：

def str_to_img(str, debug=False):
    """Converts a given string to a PNG image, and saves it to the return variable"""
    # use 12pt Courier New for ASCII art
    font = ImageFont.truetype("cour.ttf", 12)

    # do some string preprocessing
    str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
    str = html.unescape(str)

    img = Image.new('RGB', (1,1))
    d = ImageDraw.Draw(img)

    str_by_line = str.split("\n")
    num_of_lines = len(str_by_line)

    line_widths = []
    for i, line in enumerate(str_by_line):
        line_widths.append(d.textsize(str_by_line[i], font=font)[0])
    line_height = d.textsize(str, font=font)[1]     # the height of a line of text should be unchanging

    img_width = max(line_widths)                                    # the image width is the largest of the individual line widths
    img_height = num_of_lines * line_height             # the image height is the # of lines * line height

    # creating the output image
    # add 5 pixels to account for lowercase letters that might otherwise get truncated
    img = Image.new('RGB', (img_width, img_height + 5), 'white')
    d = ImageDraw.Draw(img)

    for i, line in enumerate(str_by_line):
        d.text((0,i*line_height), line, font=font, fill='black')
    output = BytesIO()

    if (debug):
        img.save('test.png', 'PNG')
    else:
        img.save(output, 'PNG')

    return output

就像我说的，我将所有内容都编码为 utf-8，但特殊字符无法正确显示。我也在使用官方 .ttf 文件中的 Courier New，它应该支持广泛的字符和符号库，所以我也不确定是什么问题。

我觉得这很明显。任何人都可以启发我吗？这不是 ImageDraw，是吗？最重要的是，整个文本编码似乎有点模棱两可，所以即使在阅读了其他 StackOverflow 帖子（以及有关编码的博客文章）之后，我也很难找到真正的解决方案。

【问题讨论】：

标签： python bots reddit

【解决方案1】：

我目前无法自己运行任何测试，并且由于代表人数少而无法发表评论，因此我放弃了部分答案，希望能提供一些想法来尝试什么。我对 Python 2 也有点生疏，但让我们试试吧..

所以有两件事。第一：

我一拉出来就用 utf-8 编码所有内容

你是吗？

print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)

您正在对打印输出进行编码，但将原始注释附加到输出列表中，因为它是由 praw 输出的。 praw 是否输出 unicode 对象？

因为我认为 unicode 对象是 ImageDraw 模块想要的。查看它的源代码，它似乎对您尝试呈现的文本的编码没有任何线索。这意味着在 utf8 编码的情况下，Python 2 字符串可能会呈现为单字节字符，从而导致输出中出现垃圾。

http://pillow.readthedocs.org/en/latest/reference/ImageFont.html#PIL.ImageFont.truetype 提到“编码”参数，默认为 unicode。可能值得尝试设置它以防万一。如果字体不兼容 unicode，可能会引发错误。

Python 2 中的编码并不有趣。但有一件事我仍然会尝试确保将 unicode 对象传递给 ImageDraw（尝试 unicode(str) 或 str.decode("utf8")）

【讨论】：