从熊猫中的嵌套字典中自动提取列答案

【问题标题】：Auto-extracting columns from nested dictionaries in pandas从熊猫中的嵌套字典中自动提取列
【发布时间】：2021-08-11 17:11:49
【问题描述】：

所以我在 jsonl 文件列中有这个嵌套的多个字典，如下所示：

    `df['referenced_tweets'][0]`

生产（缩短输出）

  'id': '1392893055112400898',
  'public_metrics': {'retweet_count': 0,
   'reply_count': 1,
   'like_count': 2,
   'quote_count': 0},
  'conversation_id': '1392893055112400898',
  'created_at': '2021-05-13T17:22:37.000Z',
  'reply_settings': 'everyone',
  'entities': {'annotations': [{'start': 65,
     'end': 77,
     'probability': 0.9719000000000001,
     'type': 'Person',
     'normalized_text': 'Jill McMillan'}],
   'mentions': [{'start': 23,
     'end': 36,
     'username': 'usasklibrary',
     'protected': False,
     'description': 'The official account of the University Library at USask.',
     'created_at': '2019-06-04T17:19:12.000Z',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'url': '*removed*',
         'expanded_url': 'http://library.usask.ca',
         'display_url': 'library.usask.ca'}]}},
     'name': 'University Library',
     'url': '....',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
     'verified': False,
     'id': '1135959197902921728',
     'public_metrics': {'followers_count': 365,
      'following_count': 119,
      'tweet_count': 556,
      'listed_count': 9}}]},
  'text': 'Wonderful session with @usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!', 
...

我的目的是创建一个函数，该函数将自动提取整个数据框（不仅仅是一行）中的特定列（例如文本、类型）。于是我写了函数：

### x = df['referenced_tweets']

def extract_TextType(x):
    dic = {}
    for i in x:
        if i != " ":
            new_df= pd.DataFrame.from_dict(i)
            dic['refd_text']=new_df['text']
            dic['refd_type'] = new_df['type']
        else:
            print('none')
    return dic

但是运行函数：

df['referenced_tweets'].apply(extract_TextType)

产生错误：

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

重点是从原始“引用推文”列中提取这两个嵌套列（文本和类型）并将它们与原始行匹配。

请问我做错了什么？

附：原始df在下面被抓拍：

【问题讨论】：

标签： python-3.x pandas dictionary twitter jsonlines

【解决方案1】：

这里需要考虑几件事。 referenced_tweets 包含一个列表，因此这一行 new_df= pd.DataFrame.from_dict(i) 很可能无法按照您输入的方式正确解析。

此外，由于该列表中可能有多个推文，您可以正确地对其进行迭代，但您不需要将其放入 df 中。当您使用.apply() 时，这还将在每个单元格中创建一个新字典。如果这就是你想要的，那没关系。如果您真的只想要一个新的数据框，您可以调整以下内容。我无权访问referenced_tweets，所以我以entities 为例。这是我的例子：

ents = df[df.entities.notnull()]['entities']

dict_hold_list = []
for ent in ents:
    # print(ent['hashtags'])
    for htag in ent['hashtags']:
        # print(htag['text'])
        # print(htag['indices'])
        dict_hold_list.append({'text': htag['text'], 'indices': htag['indices']})
df_hashtags = pd.DataFrame(dict_hold_list)

因为您没有提供良好的工作 json 或数据框，我无法对此进行测试，但您的解决方案可能如下所示

refs = df[df.referenced_tweets.notnull()]['referenced_tweets']

dict_hold_list = []
for ref in refs:
    # print(ref)
    for r in ref:
        # print(r['text'])
        # print(r['type'])
        dict_hold_list.append({'text': r['text'], 'type': r['type']})
df_ref_tweets = pd.DataFrame(dict_hold_list)

【讨论】：