【问题标题】:DataFrames: iterating over set-values to create multiple boolean columns?DataFrames:迭代设置值以创建多个布尔列?
【发布时间】:2020-04-27 14:12:57
【问题描述】:

term 存储一个包含几个字符串的集合(在大约 1000 个字符串的固定集合中)

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])

Out[1]
           terms
0  {mouse, city}
1        {mouse}
2         {blue}

我想遍历行并计算每行每个唯一术语的出现次数,因此我计划为找到的每个术语创建一个布尔列。比如:

           terms  has_mouse  has_city  has_blue
0  {mouse, city}          1         1         0
1        {mouse}          1         0         0
2         {blue}          0         0         1

我试过了:

def count_terms_in_row(row):
    for term in row['terms']:
        row['has_{}'.format(term)] = 1

df.apply(count_terms_in_row, axis=1)

但是,这并没有按计划进行。这里的正确方法是什么?

【问题讨论】:

  • df.terms.apply(len)?
  • 谢谢,请看编辑 - 需要分别计算每个术语。

标签: python pandas dataframe data-processing


【解决方案1】:

您可以执行以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])


all_terms = set()
for idx, data in df.iterrows():
  all_terms = all_terms.union(data["terms"])

# find out all new columns
new_columns = []
term2idx = {}
for idx, term in enumerate(all_terms):
  new_columns.append("has_term_{}".format(term))
  term2idx[term] = idx

# add new data per new column
new_data = []
for idx, data in df.iterrows():
  _row = [0] * len(new_columns)
  for term in data["terms"]:
    _row[term2idx[term]] = 1
  new_data.append(_row)

# add new data to existing DataFrame
new_data = np.asarray(new_data)
for idx in range(len(new_columns)):
  df[new_columns[idx]] = new_data[:,idx]

print(df.head())

这会导致:

    terms   has_term_city   has_term_blue   has_term_mouse
0   {city, mouse}   1   0   1
1   {mouse} 0   0   1
2   {blue}  0   1   

【讨论】:

    【解决方案2】:

    这本质上是get_dummies

    df.join(pd.get_dummies(df.terms.apply(list).explode())
              .sum(level=0)
              .add_prefix('has_')
           ) 
    

    输出:

               terms  has_blue  has_city  has_mouse
    0  {mouse, city}         0         1          1
    1        {mouse}         0         0          1
    2         {blue}         1         0          0
    

    【讨论】:

      【解决方案3】:

      你可以试试这个:

      df['count'] = df['terms'].str.len()
      print(df)
      
                 terms  count
      0  {mouse, city}      2
      1        {mouse}      1
      2         {blue}      1
      

      【讨论】:

      • 谢谢,请看编辑 - 需要分别计算每个术语。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-11-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多