【问题标题】:Split a list from Dataframe column into specific column name [duplicate]将数据框列中的列表拆分为特定的列名[重复]
【发布时间】:2021-06-02 15:00:49
【问题描述】:

我有一个关于将数据框列中的列表拆分为多列的问题。但是每个被拆分的值都需要放在特定的列中。

假设我有这个数据框:

date                   data
2020-01-01 00:00:00    [G07, G08, G10, G16]
2020-01-01 00:00:01    [G07, G08, G16]
2020-01-01 00:00:02    [G08, G10, G16, G20, G21]
2020-01-01 00:00:03    [G16, G20, G21, G26, G27, R02]
2020-01-01 00:00:04    [G07, G08, G26, G27]

我正在寻找这种结果:

date                   G07  G08  G10  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:00    G07  G08  G10  G16  NaN  NaN  NaN  NaN  NaN
2020-01-01 00:00:01    G07  G08  NaN  G16  NaN  NaN  NaN  NaN  NaN
2020-01-01 00:00:02    NaN  G08  G10  G16  G20  G21  NaN  NaN  NaN
2020-01-01 00:00:03    NaN  NaN  NaN  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:04    G07  G08  NaN  NaN  NaN  NaN  G26  G27  NaN

要最终得到这种矩阵:

date                   G07  G08  G10  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0    
2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0    
2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1    
2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0    

通过执行这种类型的命令:

In [1] pd.DataFrame(self.df['data'].to_list())

Out [1] date                   1    2    3    4    5    6    
        2020-01-01 00:00:00    G07  G08  G10  G16
        2020-01-01 00:00:01    G07  G08  G16
        2020-01-01 00:00:02    G08  G10  G16  G20  G21
        2020-01-01 00:00:03    G16  G20  G21  G26  G27  R02
        2020-01-01 00:00:04    G07  G08  G26  G27

我只能将列表拆分为其他列。但我找不到将每个值放入特定列的方法。

我一直在考虑对每个日期的每个值进行循环,但速度很慢,而且我的数据集超过 1,000,000 行。

【问题讨论】:

标签: python pandas dataframe


【解决方案1】:

MultiLabelBinarizer 联系sklearn

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

s = pd.DataFrame(mlb.fit_transform(df['data']),columns=mlb.classes_, index=df.index)

df = df.join(s)

【讨论】:

【解决方案2】:

另一种方法:

x = (
    pd.DataFrame([{k: 1 for k in v} for v in df["data"]])
    .replace(np.nan, 0)
    .astype(int)
)
print(pd.concat([df["date"], x], axis=1))

打印:

                  date  G07  G08  G10  G16  G20  G21  G26  G27  R02
0  2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
1  2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0
2  2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0
3  2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1
4  2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0

【讨论】:

    【解决方案3】:

    尝试通过join()strip()get_dummies()drop() 方法:

    out=df.join(df['data'].astype(str).str.strip('[]').str.get_dummies(',')).drop('data',1)
    

    out的输出:

    【讨论】:

    • 非常感谢您的快速回复。但似乎在输出时列 G08 和 G16 是重复的。
    【解决方案4】:

    再添加一种方法:

    k = df.explode('data').assign(temp = 1)
    df = k.pivot(*k).fillna(0)
    

    进一步转换(如果需要):

    df = df.rename_axis(columns=None).reset_index().convert_dtypes()
    

    输出:

                      date  G07  G08  G10  G16  G20  G21  G26  G27  R02
    0  2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
    1  2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0
    2  2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0
    3  2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1
    4  2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0
    

    【讨论】:

      【解决方案5】:

      使用熊猫做事

      (dataf
          .explode("data")
          .pivot(index="date", columns="data", values="data")
          .notna()
          .astype(int))
      

      我们得到所需的输出格式:

      data                  G08   G1   G10   G16   G2  ...   G27   R0  G07  G08  G16
      date                                             ...                          
      2020-01-01 00:00:00     1    1     1     0    0  ...     0    0    1    0    0
      2020-01-01 00:00:01     1    1     0     0    0  ...     0    0    1    0    0
      2020-01-01 00:00:02     0    0     1     1    1  ...     0    0    0    1    0
      2020-01-01 00:00:03     0    0     0     0    0  ...     1    1    0    0    1
      2020-01-01 00:00:04     1    0     0     0    1  ...     0    0    1    0    0
      

      我们分解data 列,以date 作为索引和data 作为列值对表进行透视。然后,使用 get TrueFalse 从检查 na 并转换为 int ;)

      数据和代码
      
      import io
      import pandas as pd
      
      
      data = io.StringIO("""
      date|data
      2020-01-01 00:00:00|[G07, G08, G10, G16]
      2020-01-01 00:00:01|[G07, G08, G16]
      2020-01-01 00:00:02|[G08, G10, G16, G20, G21]
      2020-01-01 00:00:03|[G16, G20, G21, G26, G27, R02]
      2020-01-01 00:00:04|[G07, G08, G26, G27]
      """)
      
      dataf = pd.read_csv(data, sep="|", parse_dates=["date"], converters={"data":lambda x: x[1:-2].split(",")})
      

      【讨论】:

        猜你喜欢
        • 2018-10-31
        • 2018-07-26
        • 1970-01-01
        • 2013-04-21
        • 1970-01-01
        • 2020-12-14
        • 2022-12-22
        • 2019-12-28
        • 2015-05-14
        相关资源
        最近更新 更多