如何从列中提取字符串的某些部分以在 Pandas 中创建其他列答案

【问题标题】：How to extract certain parts of a string from column to create other columns in Pandas如何从列中提取字符串的某些部分以在 Pandas 中创建其他列
【发布时间】：2021-04-16 04:09:02
【问题描述】：

我有一个看起来像这样的数据框

Title	Ratings
Do schools kill creativity?	[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]
Simple designs to save a life	[{'id': 9, 'name': 'Ingenious', 'count': 269}, {'id': 3, 'name': 'Courageous', 'count': 92}, {'id': 7, 'name': 'Funny', 'count': 131}, {'id': 2, 'name': 'Confusing', 'count': 42}, {'id': 1, 'name': 'Beautiful', 'count': 91}, {'id': 8, 'name': 'Informative', 'count': 446}, {'id': 10, 'name': 'Inspiring', 'count': 397}, {'id': 22, 'name': 'Fascinating', 'count': 515}, {'id': 11, 'name': 'Longwinded', 'count': 45}, {'id': 21, 'name': 'Unconvincing', 'count': 49}, {'id': 24, 'name': 'Persuasive', 'count': 1234}, {'id': 25, 'name': 'OK', 'count': 73}, {'id': 23, 'name': 'Jaw-dropping', 'count': 139}, {'id': 26, 'name': 'Obnoxious', 'count': 21}]

Title

Ratings

Do schools kill creativity?

[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]

Simple designs to save a life

[{'id': 9, 'name': 'Ingenious', 'count': 269}, {'id': 3, 'name': 'Courageous', 'count': 92}, {'id': 7, 'name': 'Funny', 'count': 131}, {'id': 2, 'name': 'Confusing', 'count': 42}, {'id': 1, 'name': 'Beautiful', 'count': 91}, {'id': 8, 'name': 'Informative', 'count': 446}, {'id': 10, 'name': 'Inspiring', 'count': 397}, {'id': 22, 'name': 'Fascinating', 'count': 515}, {'id': 11, 'name': 'Longwinded', 'count': 45}, {'id': 21, 'name': 'Unconvincing', 'count': 49}, {'id': 24, 'name': 'Persuasive', 'count': 1234}, {'id': 25, 'name': 'OK', 'count': 73}, {'id': 23, 'name': 'Jaw-dropping', 'count': 139}, {'id': 26, 'name': 'Obnoxious', 'count': 21}]

我想将 Ratings 中的数据解析为看起来像

Title	Rating	Count
Do schools kill creativity?	Funny	19645
Do schools kill creativity?	Beautiful	4573

我尝试使用 } 作为分隔符来分解数据

#explode ratings by title
df['ratings'] = df['ratings'].str.split('}')
df_explode_ratings = df.explode('ratings').reset_index(drop=True)
cols = list(df_explode_ratings.columns)
cols.append(cols.pop(cols.index('title')))
df_explode_ratings = df_explode_ratings[cols]
df_explode_cols = ['title', 'ratings']
df_explode_ratings = df_explode_ratings.drop(columns=[col for col in df_explode_ratings if col not in df_explode_cols])

这可行，但我仍然需要进一步解析它，我打算再次拆分，但在 Ratings 列中得到了 NaN 值。

【问题讨论】：

在你得到这个数据框之前会发生什么？看起来可以重新设计导致此数据结构的过程，以为您提供更有用的文件。如果没有，并且如果您没有大量的行，您甚至可以更好地循环行并使用 json 模块将字符串加载到 Ratings 中。
嘿，谢谢，这是来自 Kaggle 的 .csv，看起来它是从 json 转储的，所以我无法控制文件中的数据集结构

标签： python pandas dataframe

【解决方案1】：

您的专栏是Ratings 字符串还是字典列表？如果是字符串，你可以应用ast.literal_eval然后展开列（如果是字典列表，你可以省略literal_eval这一步）：

from ast import literal_eval

df.Ratings = df.Ratings.apply(literal_eval)
df = df.explode("Ratings")
df["Rating"] = df.apply(lambda x: x["Ratings"]["name"], axis=1)
df["Count"] = df.apply(lambda x: x["Ratings"]["count"], axis=1)
df = df.drop(columns="Ratings")
print(df)

打印：

                           Title        Rating  Count
0    Do schools kill creativity?         Funny  19645
0    Do schools kill creativity?     Beautiful   4573
0    Do schools kill creativity?     Ingenious   6073
0    Do schools kill creativity?    Courageous   3253
0    Do schools kill creativity?    Longwinded    387
0    Do schools kill creativity?     Confusing    242
0    Do schools kill creativity?   Informative   7346
0    Do schools kill creativity?   Fascinating  10581
0    Do schools kill creativity?  Unconvincing    300
0    Do schools kill creativity?    Persuasive  10704
0    Do schools kill creativity?  Jaw-dropping   4439
0    Do schools kill creativity?            OK   1174
0    Do schools kill creativity?     Obnoxious    209
0    Do schools kill creativity?     Inspiring  24924
1  Simple designs to save a life     Ingenious    269
1  Simple designs to save a life    Courageous     92
1  Simple designs to save a life         Funny    131
1  Simple designs to save a life     Confusing     42
1  Simple designs to save a life     Beautiful     91
1  Simple designs to save a life   Informative    446
1  Simple designs to save a life     Inspiring    397
1  Simple designs to save a life   Fascinating    515
1  Simple designs to save a life    Longwinded     45
1  Simple designs to save a life  Unconvincing     49
1  Simple designs to save a life    Persuasive   1234
1  Simple designs to save a life            OK     73
1  Simple designs to save a life  Jaw-dropping    139
1  Simple designs to save a life     Obnoxious     21

但正如 cmets 中所建议的，更好的是在创建 DataFrame 之前处理/解析数据。

【讨论】：

很好的回答，没想到申请literal_eval！