【发布时间】:2020-07-23 21:30:32
【问题描述】:
def _clean(text):
text = text.lower()
text = re.sub(r'RT|rt', '', text)
text = re.sub(r'&', '&', text)
text = re.sub(r'[?!.;:,#@-]', '', text)
text = re.sub(r"[$&+,:;=?#]|[0-9]|<ed>|<U\+[A-Z0-9]+>", "", text)
text = re.sub("<+[A-Z0-9]+>", "", text)
text = re.sub(r'https?|:\//\w.*', '', text)
text = re.sub(r'\//?w*', '',text)
text = re.sub(r'\ ã°â*', '' ,text)
words = text.split()
words = [w for w in words if w not in stopwords]
text = " ".join(words)
text = emoji_pattern.sub(r'', text)
return text
到目前为止我已经使用了上面的代码。我不知道如何清理这个
上周五晚上生日快乐 (tgif) ððððð 上周五晚上 tgif ff ......
【问题讨论】: