【问题标题】:reading a csv file into pandas with one column is like a list, create new rows用一列将 csv 文件读入 pandas 就像一个列表,创建新行
【发布时间】:2016-04-30 23:21:56
【问题描述】:

我有一个格式如下的 csv 文件。

id  results_numbers results                                                  creation_time
9680    2           [(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')]        11/10/14 0:23
9690    3           [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')]  12/10/14 0:10

我想将其读入熊猫并隐蔽为以下内容:

id     results_numbers  new_id name              creation_time
9680    2               9394   lesbyfaye         11/10/14 0:23
9680    3                999   Kayts & Koilsby   11/10/14 0:23
9690    3               5968   Jacobsonl         12/10/14 0:10
9690    3                 47   SarHix            12/10/14 0:10
9690    3               8825   joy               12/10/14 0:10

【问题讨论】:

    标签: list python-3.x pandas rows


    【解决方案1】:

    假设您可以读取数据框:

    df = pd.DataFrame({'id': [9680, 9690], 'results_number': [2, 3], 'results': [[(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')], [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')]], 'creation_time': ["11/10/14 0:23", "12/10/14 0:10"]})
    
    >>>> pd.DataFrame([[row.id, row.results_number, tup[0], tup[1], row.creation_time] 
                       for _, row in df.iterrows() 
                       for tup in row.results], 
                      columns=['id', 'results_numbers', 'new_id', 'name', 'creation_time'])
    
         id  results_numbers  new_id             name  creation_time
    0  9680                2    9394        lesbyfaye  11/10/14 0:23
    1  9680                2     999  Kayts & Koilsby  11/10/14 0:23
    2  9690                3    5968        Jacobsonl  12/10/14 0:10
    3  9690                3      47           SarHix  12/10/14 0:10
    4  9690                3    8825              joy  12/10/14 0:10
    

    编辑

    如果您的数据格式不正确,请尝试以下操作:

    good_data = []
    bad_data = []
    for _, row in df.iterrows():
        for n, tup in enumerate(row.results):
            if len(tup) == 2:
                good_data.append([row.id, row.results_number, tup[0], tup[1], row.creation_time])
            else:
                bad_data.append(n, tup)
    

    【讨论】:

    • 亚历山大,谢谢。这适用于问题中的数据集。但是,当我将其应用于整个数据集时,我得到了:IndexError: string index out of range
    • 好的,如果数据格式正确,您的第一个解决方案效果很好。但我发现“结果”被截断为 512 个字符。因此,由于截断,我可能在末尾有这种“结果”: [(47, u'SarHix'), (8825, u'joy'), .........., (6582 , u'tevez'), (135, u'tr
    【解决方案2】:

    你也可以尝试不循环:

    原始 DF:

    In [184]: df
    Out[184]:
       creation_time    id                                         results  \
    0  11/10/14 0:23  9680     [(9394, lesbyfaye), (999, Kayts & Koilsby)]
    1  12/10/14 0:10  9690  [(5968, Jacobsonl), (47, SarHix), (8825, joy)]
    
       results_number
    0               2
    1               3
    

    解决方案:

    In [189]: tmp = (pd.DataFrame.from_dict(df.results.to_dict(), orient='index')
       .....:          .stack()
       .....:          .reset_index(level=1, drop=True)
       .....:       )
    
    In [190]: idx = tmp.index
    
    In [191]: new = (pd.DataFrame(tmp.tolist(), columns=['new_id','name'], index=idx)
       .....:          .join(df.drop(['results'], axis=1))
       .....:       )
    

    结果:

    In [192]: new
    Out[192]:
       new_id             name  creation_time    id  results_number
    0    9394        lesbyfaye  11/10/14 0:23  9680               2
    0     999  Kayts & Koilsby  11/10/14 0:23  9680               2
    1    5968        Jacobsonl  12/10/14 0:10  9690               3
    1      47           SarHix  12/10/14 0:10  9690               3
    1    8825              joy  12/10/14 0:10  9690               3
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-10-25
      • 1970-01-01
      • 1970-01-01
      • 2021-07-15
      • 2019-10-05
      相关资源
      最近更新 更多