将数据框与列名列表匹配答案

【问题标题】：match the dataframe with the list of columns names将数据框与列名列表匹配
【发布时间】：2016-08-26 15:17:12
【问题描述】：

我有两个文件，第一个包含数据框，没有列名：

2008-03-13 15  56   0  25  
2008-03-14 10  32  27  45  
2008-03-16 40   8  54  35  
2008-03-18 40   8  63  30  
2008-03-19 45  32  81  25

和另一个文件，其中包含以下格式的列名称列表（日期时间列除外）： file.read()的输出

列表（组、年龄、收入、位置）

在我的真实数据中，列和列名要多得多。数据框的列作为列表元素排序，即第一列对应于组，第三列对应于收入，最后一列对应于位置等。所以我的目标是用包含在这个文件中的元素来命名我的数据框的列。由于明显的原因，此操作将不起作用（列表中不包含日期时间列，并且列表未格式化为 python 形式）：

with open(file2) as f:
    list_of_columns=f.read()
df=pd.read_csv(file1, sep='/t', names=list_of_columns)

我已经想象了从 file2 的输出中删除单词 List 和 () 并在列表头部添加列 datetime 的预处理工作，但是如果您有更优雅和快速的解决方案，让我知道！

【问题讨论】：

标签： python list pandas dataframe match

【解决方案1】：

你可以这样做：

import re

fn = r'D:\temp\.data\36972593_header.csv'
with open(fn) as f:
    data = f.read()

# it will also tolerate if `List(...) is not in the first line`
cols = ['Date'] + re.sub(r'.*List\((.*)\).*', r'\1', data, flags=re.S|re.I|re.M).replace(' ', '').split(',')

fn = r'D:\temp\.data\36972593_data.csv'
# this will also parse `Date` column as `datetime`
df=pd.read_csv(fn, sep=r'\s+', names=cols, parse_dates=[0])

结果：

In [82]: df
Out[82]:
        Date  Group  Age  Income  Location
0 2008-03-13     15   56       0        25
1 2008-03-14     10   32      27        45
2 2008-03-16     40    8      54        35
3 2008-03-18     40    8      63        30
4 2008-03-19     45   32      81        25

In [83]: df.dtypes
Out[83]:
Date        datetime64[ns]
Group                int64
Age                  int64
Income               int64
Location             int64
dtype: object

【讨论】：

【解决方案2】：

如果列名列表是完全采用这种格式的字符串，您可以这样做：

with open(file2) as f:
    list_of_columns=f.read()
list_of_columns = ['date'] + list_of_columns[5:-1].split(',')
list_of_columns = [l.strip() for l in list_of_columns] # remove leading/trailing whitespace
df=pd.read_csv(file1, sep='/t', names=list_of_columns)

【讨论】：