【问题标题】:Python - Adding new column with mapped value from a dictionary containing a list of valuesPython - 从包含值列表的字典中添加具有映射值的新列
【发布时间】:2017-12-20 02:47:05
【问题描述】:

我正在尝试从映射字典中向数据框中添加至少一列,甚至多列。我有一本以产品目录号为关键字的字典,其中包含该产品号的标准化分层命名法列表。下面的例子。

dict = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
df = pd.DataFrame( {"product": [1, 2, 3]})
df['catagory'] = df['product'].map(dict)
print(df)

我得到以下结果:

    product      catagory
0        1  [a, b, c, d]
1        2  [w, x, y, z]
2        3           NaN

我想获得以下内容:

     product     cat1     cat2     cat3     cat4
0       1          a        b       c         d
1       2          w        x       y         z
2       3         NaN      NaN     NaN       NaN

甚至更好:

     product     category
0       1           d
1       2           z
2       3         NaN  

我一直在尝试从字典中的列表中解析我们的一项并将其附加到数据框中,但根据EXAMPLE,只找到了映射包含列表中一项的字典的建议。

任何帮助表示赞赏。

【问题讨论】:

标签: python pandas dictionary mapping lookup


【解决方案1】:

通知

切勿使用保留字如listtypedict...作为变量,因为会屏蔽内置函数。

所以如果使用:

#dict is variable name
dict = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
#create dictionary is not possible, because dict is dictionary
print (dict(a=1, b=2))
{'a': 1, 'b': 2}

得到错误:

TypeError: 'dict' 对象不可调用

而且调试非常复杂。 (测试后重启IDE)

所以使用另一个变量,例如 dcategories

d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
print (dict(a=1, b=2))
{'a': 1, 'b': 2}

我认为你需要DataFrame.from_dictjoin

d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
df = pd.DataFrame( {"product": [1, 2, 3]})
print (df)
   product
0        1
1        2
2        3

df1 = pd.DataFrame.from_dict(d, orient='index')
df1.columns = ['cat' + (str(i+1)) for i in df1.columns]
print(df1)
  cat1 cat2 cat3 cat4
1    a    b    c    d
2    w    x    y    z

df2 = df.join(df1, on='product')
print (df2)
   product cat1 cat2 cat3 cat4
0        1    a    b    c    d
1        2    w    x    y    z
2        3  NaN  NaN  NaN  NaN

然后可以使用meltstack

df3 = df2.melt('product', value_name='category').drop('variable', axis=1)
print (df3)
    product category
0         1        a
1         2        w
2         3      NaN
3         1        b
4         2        x
5         3      NaN
6         1        c
7         2        y
8         3      NaN
9         1        d
10        2        z
11        3      NaN

df2 = df.set_index('product').join(df1)
        .stack(dropna=False)
        .reset_index(level=1, drop=True)
        .rename('category')
        .reset_index()
print (df2)
    product category
0         1        a
1         1        b
2         1        c
3         1        d
4         2        w
5         2        x
6         2        y
7         2        z
8         3      NaN
9         3      NaN
10        3      NaN
11        3      NaN

如果category 列在df 中,解决方案类似,只需通过DataFrame.dropna 删除带有NaN 的行:

d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
df = pd.DataFrame( {"product": [1, 2, 3]})
df['category'] = df['product'].map(d)
print(df)

df1 = df.dropna(subset=['category'])
df1 = pd.DataFrame(df1['category'].values.tolist(), index=df1['product'])
df1.columns = ['cat' + (str(i+1)) for i in df1.columns]
print(df1)
        cat1 cat2 cat3 cat4
product                    
1          a    b    c    d
2          w    x    y    z

df2 = df[['product']].join(df1, on='product')
print (df2)
   product cat1 cat2 cat3 cat4
0        1    a    b    c    d
1        2    w    x    y    z
2        3  NaN  NaN  NaN  NaN

【讨论】:

    【解决方案2】:
    d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
    
    #Split product to 4 columns
    df[['product']].join(
        df.apply(lambda x: pd.Series(d.get(x['product'],[np.nan])),axis=1)
          .rename_axis(lambda x: 'cat{}'.format(x+1), axis=1)
        )
    Out[187]: 
       product cat1 cat2 cat3 cat4
    0        1    a    b    c    d
    1        2    w    x    y    z
    2        3  NaN  NaN  NaN  NaN
    
    #only take the last element
    df['catagory'] = df.apply(lambda x: d.get(x['product'],[np.nan])[-1],axis=1)
    
    df
    Out[171]: 
       product catagory
    0        1        d
    1        2        z
    2        3      NaN
    

    【讨论】:

      【解决方案3】:

      让我们使用set_indexapplyadd_prefixreset_index

      df_out = (df.set_index('product')['catagory']
        .apply(lambda x:pd.Series(x)))
      
      df_out.columns = df_out.columns + 1
      
      df_out.add_prefix('cat').reset_index()
      

      输出:

         product cat1 cat2 cat3 cat4
      0        1    a    b    c    d
      1        2    w    x    y    z
      2        3  NaN  NaN  NaN  NaN
      

      给它下一个even better setp:

      (df.set_index('product')['catagory']
        .apply(lambda x:pd.Series(x))
        .stack(dropna=False)
        .rename('category')
        .reset_index()
        .drop('level_1',axis=1)
        .drop_duplicates()
      )
      

      输出:

         product category
      0        1        a
      1        1        b
      2        1        c
      3        1        d
      4        2        w
      5        2        x
      6        2        y
      7        2        z
      8        3      NaN
      

      【讨论】:

        猜你喜欢
        • 2014-08-04
        • 2023-01-18
        相关资源
        最近更新 更多