除了文本，如何从文本中删除数字、标点、空格和特殊字符？ [复制]答案

【问题标题】：Other than text how to remove numbers , punctuation, white spaces and special characters from text? [duplicate]除了文本，如何从文本中删除数字、标点、空格和特殊字符？ [复制]
【发布时间】：2020-08-02 13:31:50
【问题描述】：

我刚刚从网站上抓取文本数据，该数据包含数字、特殊字符和标点符号。拆分数据后，我尝试保留纯文本，但我得到了 spcaes、数字、特殊字符。如何删除所有这些东西并保持文本不受上面的东西影响。

url = 'www.example.com'
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
extracted_data = text.split()
refined_data = []
SYMBOLS = '{}()[].,:;+-*/&|<>=~0123456789'
for i in extracted_data:
    if i not in SYMBOLS:
       refined_data.append(i)
print("\n", "$" * 50, "HEYAAA we got arround: ", len(refined_data), " of keywords! Here are they: ","$" * 50, "\n")
print(type(refined_data)) 


output:

1.My
2.system
3.showing
4.error
5.404
6.I
7.don't
8.understand
9.why
10. it
11. showing ,
12.like
13.this?
14.53251
15.$45

【问题讨论】：

由于您所问的情况很多，最好显示示例文本和所需的输出
@ashishmishra 我刚刚添加了一个示例输出。提取的文本包含更多的标点符号、空格、数字和特殊字符。所以我想从我的文本中清除所有这些并保持我的文本简单。

标签： python-3.x urllib

【解决方案1】：

extracted_data 是 string.split() 的结果

这样使用的 string.split() 方法会将您的文本沿“任何空格”拆分。

not in 运算符将i（整个字符串）与序列进行比较。您在此处的序列只是一个字符串，因此它就像该字符串中各个字符的列表。

序列SYMBOLS中的“系统”也是如此吗？再次问：字符串'system'是SYMBOLS中的任何字符吗？不它不是。因此，您的 if 语句将被执行并附加到您的产品中。

'53251' 是否在一个字符 SYMBOLS 的列表中？不是，不是。因此，它被附加。

等等。

这样的列表比较是不必要的。你应该使用str.strip()

【讨论】：

我应该在哪里使用它？