【问题标题】:Replace specific words by user dictionary and others by 0用用户词典和其他人替换特定单词为0
【发布时间】:2019-05-10 01:48:15
【问题描述】:

所以我有一个评论数据集,其中包含类似

的评论

简直是最好的。我去年买了这个。还在用。没问题 面对迄今为止。惊人的电池寿命。在黑暗或广阔的环境中工作正常 白天。送给任何书迷的最佳礼物。

(这是来自原始数据集,我已删除所有标点符号并在我处理的数据集中全部小写)

我想要做的是将一些单词替换为 1(根据我的字典),将其他单词替换为 0。 我的字典是

dict = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}

我希望我的输出如下:

0010000000000001000000000100000

我用过这段代码:

df['newreviews'] = df['reviews'].map(dict).fillna("0")

这总是返回 0 作为输出。我不想这样,所以我将 1 和 0 作为字符串,但尽管如此,我得到了相同的结果。 有什么建议可以解决这个问题吗?

【问题讨论】:

  • 您没有在任何地方拆分字符串以使此映射正常工作,您还应该使用 dict 作为变量名,因为它掩盖了 python 的内置 dict 类型。
  • @AChampion 如何拆分字符串以使地图工作?
  • 发布您的df['reviews']的可测试片段
  • 您可能想要执行以下操作:df.reviews.str.split().apply(lambda review: ''.join(d.get(word, '0') for word in review)) 假设您已经降低并删除了所有标点符号(并将 dict 重命名为 d)。

标签: python python-3.x pandas dictionary dataframe


【解决方案1】:

首先不要使用dict作为变量名,因为内置(python保留字),然后使用list comprehensionget将不匹配的值替换为0

通知

如果数据类似于date.Amazing - 标点符号后不需要空格替换为空格。

df = pd.DataFrame({'reviews':['Simply the best. I bought this last year. Still using. No problems faced till date.Amazing battery life. Works fine in darkness or broad daylight. Best gift for any book lover.']})

d = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}

df['reviews']  = df['reviews'].str.replace(r'[^\w\s]+', ' ').str.lower()

df['newreviews'] = [''.join(d.get(y, '0')  for y in x.split()) for x in df['reviews']]

替代方案:

df['newreviews'] =  df['reviews'].apply(lambda x: ''.join(d.get(y, '0')  for y in x.split()))

print (df)
                                             reviews  \
0  simply the best  i bought this last year  stil...   

                        newreviews  
0  0011000000000001000000000100000  

【讨论】:

  • 注意:OP声称已经降低和删除了标点符号,所以你可能做的太多了:)。你也错过了'Amazing',因为标点符号周围没有空格-'... date.Amazing ...'
  • @AChampion - 谢谢,解决方案应该是用空格替换标点符号。
【解决方案2】:

你可以这样做

df.replace(repl, regex=True, inplace=True)

df 是您的数据框,repl 是您的字典。

【讨论】:

    【解决方案3】:

    你可以这样做:

    # clean the sentence
    import re
    sent = re.sub(r'\.','',sent)
    
    # convert to list
    sent = sent.lower().split()
    
    # get values from dict using comprehension
    new_sent = ''.join([str(1) if x in mydict else str(0) for x in sent])
    print(new_sent)
    
    '001100000000000000000000100000'
    

    【讨论】:

      猜你喜欢
      • 2022-11-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-04-01
      • 1970-01-01
      • 2020-03-27
      • 2014-04-29
      • 1970-01-01
      相关资源
      最近更新 更多