【问题标题】:pandas read csv with regexpandas 用正则表达式读取 csv
【发布时间】:2017-05-17 15:06:37
【问题描述】:

我有一个文件夹 trip_data 包含许多带有日期的 csv 文件,如下所示:

trip_data/
├── df_trip_20140803_1.csv
├── df_trip_20140803_2.csv
├── df_trip_20140803_3.csv
├── df_trip_20140803_4.csv
├── df_trip_20140803_5.csv
├── df_trip_20140803_6.csv
├── df_trip_20140804_1.csv
├── df_trip_20140804_2.csv
├── df_trip_20140804_3.csv
├── df_trip_20140804_4.csv
├── df_trip_20140804_5.csv
├── df_trip_20140804_6.csv
├── df_trip_20140805_1.csv
├── df_trip_20140805_2.csv
├── df_trip_20140805_3.csv
├── df_trip_20140805_4.csv
├── df_trip_20140805_5.csv
├── df_trip_20140805_6.csv
├── df_trip_20140806_1.csv
├── df_trip_20140806_2.csv
├── df_trip_20140806_3.csv
├── df_trip_20140806_4.csv

现在我想用 python pandas 按日期分别加载所有这些文件,意思是 4 DataFrame df_traip_20140803, df_traip_20140804, df_traip_20140805, df_traip_20140806

我的代码如下所示:

days = [20140803,20140804,20140805,20140806]

for day in days:
    ## Locate to the path
    path ='./trip_data/df_trip_%d*.csv' % day
    df = pd.read_csv(path, header=None, nrows=10,
                        names=['ID','lat','lon','status','timestamp']) 

无法得到正确的结果。我该怎么做?

【问题讨论】:

    标签: python regex csv pandas data-processing


    【解决方案1】:

    我会将所有这些 CSV 收集到具有以下结构的 DataFrames 字典中:

    df['20140803'] - 包含属于所有df_trip_20140803_*.csv CSV 文件的串联数据的DF。

    解决方案:

    import os
    import re
    import glob
    import pandas as pd
    
    fpattern = r'D:\temp\.data\41444939\df_trip_{}_{}.csv'
    files = glob.glob(fpattern.format('*','*'))
    
    dates = sorted(set([re.split(r'_(\d{8})_(\d+)\.(\w+)', f)[1] for f in files]))
    
    dfs = {}
    for d in dates:
        dfs[d] = pd.concat((pd.read_csv(f) for f in glob.glob(fpattern.format(d, '*'))), ignore_index=True)
    

    测试:

    In [95]: dfs.keys()
    Out[95]: dict_keys(['20140804', '20140805', '20140803', '20140806'])
    
    In [96]: dfs['20140803']
    Out[96]:
        a  b  c
    0   0  0  7
    1   3  7  1
    2   9  7  3
    3   7  4  7
    4   5  2  4
    5   0  0  4
    6   7  2  2
    7   8  4  1
    8   0  8  3
    9   3  9  0
    10  7  3  9
    11  1  9  8
    12  6  7  2
    13  3  8  1
    14  3  4  5
    15  0  9  2
    16  5  8  7
    17  8  5  4
    18  2  0  2
    19  9  6  6
    20  6  6  6
    21  2  6  9
    22  1  0  8
    23  3  1  1
    24  7  4  2
    25  7  4  2
    26  8  3  7
    27  7  3  2
    28  1  7  7
    29  3  6  5
    

    设置:

    fn = r'D:\temp\.data\41444939\a.txt'
    base_dir = r'D:\temp\.data\41444939'
    files = open(fn).read().splitlines()
    for f in files:
        pd.DataFrame(np.random.randint(0, 10, (5, 3)), columns=list('abc')) \
          .to_csv(os.path.join(base_dir, f), index=False)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-10-14
      • 2020-09-06
      • 1970-01-01
      • 2019-07-16
      • 1970-01-01
      • 1970-01-01
      • 2015-06-30
      • 1970-01-01
      相关资源
      最近更新 更多