【发布时间】:2015-08-05 14:58:41
【问题描述】:
我有一个 Reddit 机器人,它试图将 ASCII 文本转换为图像。根据this issue,我遇到了编码特殊字符的问题。
我有一个repo 专门用于这个项目,但为了简洁起见,我将发布相关代码。我尝试切换到 Python 3(因为我听说它比 Python 2 更优雅地处理 Unicode),但这并没有解决问题。
此函数从 Reddit 中提取 cmets。如您所见,我一拉取所有内容就将其编码为utf-8,这就是我感到困惑的原因。
def comments_by_keyword(r, keyword, subreddit='all', print_comments=False):
"""Fetches comments from a subreddit containing a given keyword or phrase
Args:
r: The praw.Reddit class, which is required to access the Reddit API
keyword: Keep only the comments that contain the keyword or phrase
subreddit: A string denoting the subreddit(s) to look through, default is 'all' for r/all
limit: The maximum number of posts to fetch, increase for more thoroughness at the cost of increased redundancy/running time
print_comments: (Debug option) If True, comments_by_keyword will print every comment it fetches, instead of just returning filtered ones
Returns:
An array of comment objects whose body text contains the given keyword or phrase
"""
output = []
comments = r.get_comments(subreddit, limit=1000)
for comment in comments:
# ignore the case of the keyword and comments being fetched
# Example: for keyword='RIP mobile users', comments_by_keyword would keep 'rip Mobile Users', 'rip MOBILE USERS', etc.
if keyword.lower() in comment.body.lower():
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
elif print_comments:
print(comment.body.encode('utf-8'))
print("=====\n")
return output
然后将其转换为图像:
def str_to_img(str, debug=False):
"""Converts a given string to a PNG image, and saves it to the return variable"""
# use 12pt Courier New for ASCII art
font = ImageFont.truetype("cour.ttf", 12)
# do some string preprocessing
str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
str = html.unescape(str)
img = Image.new('RGB', (1,1))
d = ImageDraw.Draw(img)
str_by_line = str.split("\n")
num_of_lines = len(str_by_line)
line_widths = []
for i, line in enumerate(str_by_line):
line_widths.append(d.textsize(str_by_line[i], font=font)[0])
line_height = d.textsize(str, font=font)[1] # the height of a line of text should be unchanging
img_width = max(line_widths) # the image width is the largest of the individual line widths
img_height = num_of_lines * line_height # the image height is the # of lines * line height
# creating the output image
# add 5 pixels to account for lowercase letters that might otherwise get truncated
img = Image.new('RGB', (img_width, img_height + 5), 'white')
d = ImageDraw.Draw(img)
for i, line in enumerate(str_by_line):
d.text((0,i*line_height), line, font=font, fill='black')
output = BytesIO()
if (debug):
img.save('test.png', 'PNG')
else:
img.save(output, 'PNG')
return output
就像我说的,我将所有内容都编码为 utf-8,但特殊字符无法正确显示。我也在使用官方 .ttf 文件中的 Courier New,它应该支持广泛的字符和符号库,所以我也不确定是什么问题。
我觉得这很明显。任何人都可以启发我吗?这不是 ImageDraw,是吗?最重要的是,整个文本编码似乎有点模棱两可,所以即使在阅读了其他 StackOverflow 帖子(以及有关编码的博客文章)之后,我也很难找到真正的解决方案。
【问题讨论】: