【问题标题】:pd.DataFrame.from_dict() does not give an expected resultpd.DataFrame.from_dict() 没有给出预期的结果
【发布时间】:2019-06-15 08:53:48
【问题描述】:

我是 Python 编程新手。我想获取此 Wikipedia 数据集 (people_wiki.csv) 中每个单词的字数。我能够获取每个单词,并且它作为字典出现,但我无法将字典键值对拆分为单独的列。我尝试了几种方法(from_dict、from_records、to_frame、pivot_table 等)这在 python 中是否可行。我将不胜感激。

样本数据集:

URI                                           name             text

<http://dbpedia.org/resource/George_Clooney>  George Clooney   'george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his...'

我试过了:

clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']

我也试过了:

clooney['word_count'].to_frame()

这是我的代码:

people = pd.read_csv("people_wiki.csv")
clooney = people[people['name'] == 'George Clooney']

from collections import Counter
clooney['word_count']= clooney['text'].apply(lambda x: Counter(x.split(' ')))

clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']
clooney _word_count_table

输出:

       word_count
35817   {'george': 1, 'timothy': 1, 'clooney': 9, 'ii': ...

我希望从 clooney_word_count_table 中获得一个包含 2 列的输出数据框:

word      count
normalize  1
george     3
combat     1
producer   2

【问题讨论】:

    标签: python dictionary dataframe word-count


    【解决方案1】:

    问题在于clooney 是一个DataFrame(包含一行,索引为35817),所以clooney['word_count'] 是一个Series,在索引35817 处包含一个值(您的计数字典)。

    DataFrame.from_dict 然后将此系列视为等同于{35817: {'george': 1,...},这就是让您感到困惑的结果。

    根据您的示例对此进行调整,并假设您要生成多个条目的组合字数:

    from collections import Counter
    import pandas as pd
    
    # Load the wikipedia entries and select the ones we care about
    people = pd.read_csv("people_wiki.csv")
    people_to_process = people[people['name'] == 'George Clooney']
    
    # Compute the counts for these entries
    counts = Counter()
    people_to_process['text'].apply(lambda text: counts.update(text.split(' ')))
    
    # Transform the counter into a DataFrame
    count_table = pd.DataFrame.from_dict(counts, orient='index', columns=['count'])
    count_table
    

    【讨论】:

    • 感谢乔,但代码仍然抛出一些错误。 (1)。 c.update 给出的 word_count 为 0。 (2)。 from_dict 给出:TypeError: from_dict() got an unexpected keyword argument 'columns'。
    • 我的回答中有错字,但我不明白您遇到的问题。据我了解,我会尝试为您的问题添加一个完整的解决方案!
    猜你喜欢
    • 2020-05-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-09
    • 2021-01-06
    • 2018-12-08
    • 1970-01-01
    相关资源
    最近更新 更多