【发布时间】:2021-09-10 11:40:53
【问题描述】:
我为大约 20k 数据运行了以下代码。虽然代码很好,我能够得到输出,但运行速度很慢。获得输出花了将近 45 分钟。有人可以提供适当的解决方案吗?
代码:
import numpy as np
import pandas as pd
import re
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))
df = pd.read_csv("data.csv")
print(df['Body'])
tweets=df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
tweets[u'Body'] = tweets[u'Body'].astype(str)
tweets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
weets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
#Preprocessing del RT @blablabla:
tweets['tweetos'] = ''
#add tweetos first part
for i in range(len(tweets['Body'])):
try:
tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]
except AttributeError:
tweets['tweetos'][i] = 'other'
#Preprocessing tweetos. select tweetos contains 'RT @'
for i in range(len(tweets['Body'])):
if tweets['tweetos'].str.contains('@')[i] == False:
tweets['tweetos'][i] = 'other'# remove URLs, RTs, and twitter handles
for i in range(len(tweets['Body'])):
tweets['Body'][i] = " ".join([word for word in tweets['Body'][i].split()
if 'http' not in word and '@' not in word and '<' not in word])
这段代码是去除特殊字符,如/n,Twitter提及,基本上是文本清理
【问题讨论】:
-
我可能完全错了,但你应该看看 python 中的线程。也许这会有所帮助