【问题标题】:How to count each word in each row python如何计算每行python中的每个单词
【发布时间】:2021-10-07 18:34:41
【问题描述】:

我有一个数据框

0       2021-03-19 20:59:49+06  ...  I only need uxy to hit 20 eod to make up for a...
1       2021-03-19 20:59:51+06  ...                                 Oh this isn’t good
2       2021-03-19 20:59:51+06  ...  lads why is my account covered in more red ink...
3       2021-03-19 20:59:51+06  ...  I'm tempted to drop my last 800 into some stup...
4       2021-03-19 20:59:52+06  ...  The sell offs will continue until moral improves.

我想使用计数器计算单词的每次出现次数,并且我想确保我只计算字符串 所以我将从

Counter()
Then when word occurs
Counter(I:1,only:1,need:1....)
Then when it will see the same word the number would be added up to the previous number

这是我尝试过的

import enchant
import pandas as pd
import string
from collections import Counter

from nltk.corpus import stopwords
from stopwords import res

discussion = pd.read_csv('discussion_thread_data.csv', error_bad_lines=False, index_col=False, dtype='unicode')
discussion = discussion.drop_duplicates('text')
discussion = discussion[discussion['text'].notnull()]
print(discussion)
# print(discussion)
d = enchant.Dict("en_US")
stop = stopwords.words('english')
word_bin = Counter()

def clean_word(word):
    res = []
    [res.append(c) for c in word if c not in string.punctuation]
    return ''.join(res)

def word_extractor(text):
    global word_bin
    words = text.split()
    words = set([clean_word(word) for word in words])
    words = [word for word in words if (word != '' and not d.check(word)) and not ['A', 'IM']]
    # words = d.check(words)
    word_bin += Counter(words)
    print(word_bin)

discussion.text.apply(lambda x: word_extractor(x))
word_bin = [word for word, cnt in word_bin.most_common(100)]

print('end')
print(word_bin)

但它不断给我每行的 Counter() 请帮忙

【问题讨论】:

    标签: python pandas counter


    【解决方案1】:

    你可以试试这个:

    from collections import Counter
    
    import pandas as pd
    
    # Toy dataframe
    df = pd.DataFrame.from_records(
        [
            ["2021-03-19 20:59:49+06", "I only need uxy to hit 20 eod to make up for a"],
            ["2021-03-19 20:59:51+06", "Oh this isn’t good"],
            ["2021-03-19 20:59:51+06", "lads why is my account covered in more red ink"],
            ["2021-03-19 20:59:51+06", "I'm tempted to drop my last 800 into some stup"],
            ["2021-03-19 20:59:52+06", "The sell offs will continue until moral improves"],
        ]
    )
    df.columns = ["date", "strings"]
    
    # Initialize a counter object
    records = Counter()
    
    # Count words in "strings" column, row by row
    for string in df["strings"].values:
        records.update(string.split(" "))
    
    print(records.most_common(10))
    # Outputs
    [
        ("to", 3),
        ("my", 2),
        ("I", 1),
        ("only", 1),
        ("need", 1),
        ("uxy", 1),
        ("hit", 1),
        ("20", 1),
        ("eod", 1),
        ("make", 1),
    ]
    

    【讨论】:

    • 在这种情况下你会如何摆脱英文单词?
    • 所以我想去掉像 to 这样的词,我
    • 这是一个与您发布的问题不同的问题,但您可以详细说明我的答案并在更新记录之前在 for 循环中添加一个条件,例如 if string not in english_stopwords:,其中 english_stopwords 是一个列表您要排除的字词。
    猜你喜欢
    • 1970-01-01
    • 2011-09-15
    • 1970-01-01
    • 2023-03-22
    • 1970-01-01
    • 2020-11-16
    • 2015-09-30
    • 2017-06-29
    相关资源
    最近更新 更多