使用正则表达式摆脱十六进制答案

【问题标题】：Getting rid of hex using regex使用正则表达式摆脱十六进制
【发布时间】：2017-07-02 05:14:55
【问题描述】：

我正在尝试从文本字符串中删除一些十六进制（例如\xc3）。我计划使用正则表达式来帮助摆脱这些。这是我的代码：

import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'\\x[a-f0-9]{2}', '', tweet)
print(tweet1)

但是，我实际上并没有删除输出，而是得到了十六进制的编码版本。这是我的输出：

b"[/Very seldom~ will someone enter your life] to questionÃ¢â¬Â¦ "

有人知道我怎样才能摆脱那些十六进制字符串吗？...在此先感谢。

【问题讨论】：

标签： python regex hex

【解决方案1】：

实际上，问题在于我如何建模问题。 tweet 不包含文字字符\xc3\xa2...，它实际上在声明字符串时对它们进行了编码。所以正则表达式正在寻找字符串\xc3，但tweet在该位置包含的实际上是Ã

解决办法是先用utf8编码，再转成string，最后用regex去掉hex。我在这篇文章中领先（看看 Martijn Pieters 的第一个答案）：python regex: how to remove hex dec characters from string

【讨论】：

【解决方案2】：

你可以试试这样的：

import re
import string

tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'[^\w\s{}]'.format(string.punctuation), '', tweet)
print(tweet1)

输出：

b"[Very seldom~ will someone enter your life] to question"

正则表达式：

[^\w\s{}] - 匹配不是\w、\s 或标点符号的所有内容。

【讨论】：

我仍然得到相同的输出。这里是：b"[/Very seldom~ will someone enter your life] to questionÃâÂ " 知道我能做什么吗？

【解决方案3】：

在应用正则表达式后尝试tweet1.decode('ascii','ignore')。

【讨论】：

我收到此错误：AttributeError: 'str' object has no attribute 'decode'。我应该编码吗？
是的。 tweet1.encode('ascii','ignore')。 python 3.x 中删除了 decode 函数。我的错，虽然你应该在标签中提到这是一个 python 3.x 问题。