【问题标题】:How to convert a column of strings to python literals and extract the values如何将一列字符串转换为python文字并提取值
【发布时间】:2021-01-16 00:35:55
【问题描述】:

我有一个DataFrame,如下所示:

id  time          activity
4   1596213715048   [{"name":"STILL","conf":100}]
4   1596213739171   [{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]
4   1596213755797   [{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]
6   1596214842817   [{"name":"STILL","conf":100}]
6   1596214931090   [{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]
8   1596214957246   [{"name":"STILL","conf":100}]
9   1596215304418   [{"name":"STILL","conf":100}]

我想根据name 拆分activity 列。生成的 DataFrame 应如下所示:

id  time          IN_VEHICLE  ON_BICYLE  ON_FOOT  WALKING  RUNNING  TILTING  STILL UNKNOWN 
4   1596213715048 0           0          0        0        0        0        100   0
4   1596213739171 8           9          19       19       0        0        54    3
4   1596213755797 1           0          0        0        0        0        97    2
6   1596214842817 0           0          0        0        0        0        100   0
6   1596214931090 28          8          15       15       0        0        34    3
8   1596214957246 0           0          0        0        0        0        100   0
9   1596215304418 0           0          0        0        0        0        100   0

如何进行这种拆分?结果列是固定的,但如果 activity 字符串中的条目在结果 DataFrame 中不作为列存在,则应引发错误。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:
    • 对于具有 100k 行的数据帧,此答案比其他 solution 快 8 倍
      • 另一种实现方式有效,但使用了两次 .apply 和列表解析,与向量化方法相比,这很慢。

    说明

    1. .apply(literal_eval)'activity' 列从strings 转换为python 文字(例如dictslists'[{"name":"STILL","conf":100}]'[{"name":"STILL","conf":100}]
    2. .explode 将每个 list 中的 dicts 分隔为单独的行
    3. 'activity' 列中的keysvalues 提取到单独的列中,然后将.join 列提取回df
      • answer 的时序分析显示,将单级dicts 的列提取到数据帧的最快方法是使用pd.DataFrame(df.pop('activity').values.tolist())
    4. .pivotdf改成宽格式
    5. dfp.columns.name'name' 更改为 None - 这是装饰性的,可以删除
    • 这是在 pandas 1.2.0 中执行的
    import pandas as pd
    from ast import literal_eval
    
    # test data
    data = {'id': [4, 4, 4, 6, 6, 8, 9], 'time': [1596213715048, 1596213739171, 1596213755797, 1596214842817, 1596214931090, 1596214957246, 1596215304418], 'activity': ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']}
    df = pd.DataFrame(data)
    
    # function to transform column of strings
    def test(df):
        df.activity = df.activity.apply(literal_eval)
        df = df.explode('activity').reset_index(drop=True)
        df = df.join(pd.DataFrame(df.pop('activity').values.tolist()))
        dfp = df.pivot(index=['id', 'time'], columns='name', values='conf').fillna(0).astype(int).reset_index()
        dfp.columns.rename(None, inplace=True)
        return dfp
    
    
    # call the function
    test(df)
    
    # result
       id           time  IN_VEHICLE  ON_BICYCLE  ON_FOOT  STILL  UNKNOWN  WALKING
    0   4  1596213715048           0           0        0    100        0        0
    1   4  1596213739171           8           9       19     54        3       19
    2   4  1596213755797           1           0        0     97        2        0
    3   6  1596214842817           0           0        0    100        0        0
    4   6  1596214931090          28           8       15     34        3       15
    5   8  1596214957246           0           0        0    100        0        0
    6   9  1596215304418           0           0        0    100        0        0
    

    %%timeit 测试

    import numpy as np
    import random
    import pandas
    import json
    from ast import literal_eval
    
    # test data with 100000 rows
    np.random.seed(365)
    random.seed(365)
    rows = 1000000
    activity = ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']
    data = {'time': pd.bdate_range('2021-01-15', freq='s', periods=rows),
            'id': np.random.randint(10, size=(rows)),
            'activity': [random.choice(activity) for _ in range(rows)]}
    df = pd.DataFrame(data)
    
    # test the function in this answer
    %%timeit -r1 -n1 -q -o
    test(df)
    [out]:
    <TimeitResult : 31.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
    
    # test the implementation from the other answer
     def flatten_json_to_dict(s):
        return {obj['name']: obj['conf'] for obj in json.loads(s)}
    
    
    def nick(df):
        expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
        df = df.join(expanded)
        df = df.drop('activity', axis=1)
        df = df.fillna(0)
        return df
    
    
    %%timeit -r1 -n1 -q -o
    nick(df)
    [out]:
    <TimeitResult : 4min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
    

    【讨论】:

      【解决方案2】:

      我写了一个基于this answer 的方法。但是,您的 JSON 格式为字典列表,而不是字典。为了解决这个问题,我定义了函数flatten_json_to_dict(),然后在活动列的每一行上调用它。

      与原始答案相比,我使用连接而不是使用赋值将列重新放入原始数据框中,我认为这不那么 hacky。

      最后一步是将缺失 (NA) 值替换为零。

      #!/usr/bin/env python3
      import pandas as pd
      import json
      
      def flatten_json_to_dict(s):
          return {obj['name']: obj['conf'] for obj in json.loads(s)}
      
      df = pd.read_csv('file.csv', delim_whitespace=True)
      df
      #    id           time                                           activity
      # 0   4  1596213715048                      [{"name":"STILL","conf":100}]
      # 1   4  1596213739171  [{"name":"STILL","conf":54},{"name":"ON_FOOT",...
      # 2   4  1596213755797  [{"name":"STILL","conf":97},{"name":"UNKNOWN",...
      # 3   6  1596214842817                      [{"name":"STILL","conf":100}]
      # 4   6  1596214931090  [{"name":"STILL","conf":34},{"name":"IN_VEHICL...
      # 5   8  1596214957246                      [{"name":"STILL","conf":100}]
      # 6   9  1596215304418                      [{"name":"STILL","conf":100}]
      
      expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
      df = df.join(expanded)
      # Remove activity column
      df = df.drop('activity', axis=1)
      # Fill NA with 0
      df = df.fillna(0)
      df
      
      #    id           time  STILL  ON_FOOT  WALKING  ON_BICYCLE  IN_VEHICLE  UNKNOWN
      # 0   4  1596213715048  100.0      0.0      0.0         0.0         0.0      0.0
      # 1   4  1596213739171   54.0     19.0     19.0         9.0         8.0      3.0
      # 2   4  1596213755797   97.0      0.0      0.0         0.0         1.0      2.0
      # 3   6  1596214842817  100.0      0.0      0.0         0.0         0.0      0.0
      # 4   6  1596214931090   34.0     15.0     15.0         8.0        28.0      3.0
      # 5   8  1596214957246  100.0      0.0      0.0         0.0         0.0      0.0
      # 6   9  1596215304418  100.0      0.0      0.0         0.0         0.0      0.0
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-02-15
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-04-18
        • 1970-01-01
        • 2012-05-16
        相关资源
        最近更新 更多