通过列名中包含的字符串将数据框中的列提取到新的数据框中答案

【问题标题】：Extracting columns from a dataframe into new dataframes by string contained in column name通过列名中包含的字符串将数据框中的列提取到新的数据框中
【发布时间】：2020-04-07 19:34:53
【问题描述】：

我有一个显示国家发电和负荷数据的数据框（数据可以从https://data.open-power-system-data.org/time_series/2019-06-05 下载，我正在使用 60 分钟的设置）。我想从这个 datafeame 中提取与数据集中每个国家/地区相关的列，并为每个国家/地区创建一个新的数据框，并为该数据框分配相应国家/地区的缩写。

到目前为止，我已经读取了原始数据，并从数据框中的列标题中获得了一个独特国家的列表，并将它们保存到一个名为 abbv 的列表中。

我正在尝试使用 abbv 列表为每个缩写（每个 i 在 abbv）创建一个数据框，并使用原始数据框中包含每个国家/地区的缩写（在 abbv 中的 i）的列填充创建的数据框。

到目前为止，我已经尝试了一个 for 循环，但不太确定这是否是正确的方法，或者我是否正在尝试以正确的方式使用循环。任何帮助，将不胜感激。我被困在嵌套的 for 循环中，不知道从那里去哪里 - 我知道代码没有按原样运行，我留下了错误试图解释我解决这个问题的思考过程。谢谢。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#read in
data_1h = pd.read_csv('/Users/xx/Downloads/opsd-time_series-2019-06-05/time_series_60min_singleindex.csv')

#get country abbreviations
abbv = [(i[:2]) for i in data_1h.columns]
abbv = pd.unique(abbv)

#create dataframe for each country
for i in range(len(abbv)):
    for i in abbv:
        i = pd.DataFrame()
        columns = [col for col in data_1h.columns if i in col]
        i = {columns: data_1h.column}

【问题讨论】：

我会使用数据框字典，您的键可以是您的国家/地区缩写，而不是为每个数据框创建新变量。
我没有想到这一点，但会试一试。我以前从未创建过数据框的字典，但经过快速搜索后，我明白了你的意思。然后，如何使用基于 abbv 的初始数据框中的正确国家/地区日期填充每个相应的数据框？我尝试使用 [col for col in data_1h.columns if i in col] 希望 col 中的 i 被识别为字符串
dict_of_dfs = {col[:2]: df.loc[:, col] for col in df.columns}怎么样

标签： python-3.x pandas dataframe for-loop

【解决方案1】：

我的建议是将列标题拆分为第一个“_”字符，然后将其转换为以国家缩写为第一级的多级索引。然后，您可以 (1) 单独拆分每个数据框，或者 (2) 像通常使用任何其他数据框一样选择您想要的任何国家/地区。

In [3]: df.head()
Out[3]:
                                               AL_load_actual_entsoe_power_statistics  ...  UA_west_load_forecast_entsoe_transparency
utc_timestamp        cet_cest_timestamp                                                ...                              
2004-12-31T23:00:00Z 2005-01-01T00:00:00+0100                                     NaN  ...                                        NaN
2005-01-01T00:00:00Z 2005-01-01T01:00:00+0100                                     NaN  ...                                        NaN
2005-01-01T01:00:00Z 2005-01-01T02:00:00+0100                                     NaN  ...                                        NaN
2005-01-01T02:00:00Z 2005-01-01T03:00:00+0100                                     NaN  ...                                        NaN
2005-01-01T03:00:00Z 2005-01-01T04:00:00+0100                                     NaN  ...                                        NaN

[5 rows x 391 columns]

In [4]: df.columns = df.columns.str.split('_', n=1, expand=True)

In [5]: df
Out[5]:
                                                                               AL  ...                                     UA
                                              load_actual_entsoe_power_statistics  ... west_load_forecast_entsoe_transparency
utc_timestamp        cet_cest_timestamp                                            ...                                  
2004-12-31T23:00:00Z 2005-01-01T00:00:00+0100                                 NaN  ...                                    NaN
2005-01-01T00:00:00Z 2005-01-01T01:00:00+0100                                 NaN  ...                                    NaN
2005-01-01T01:00:00Z 2005-01-01T02:00:00+0100                                 NaN  ...                                    NaN
2005-01-01T02:00:00Z 2005-01-01T03:00:00+0100                                 NaN  ...                                    NaN
2005-01-01T03:00:00Z 2005-01-01T04:00:00+0100                                 NaN  ...                                    NaN
...                                                                           ...  ...                                    ...
2019-04-30T19:00:00Z 2019-04-30T21:00:00+0200                                 NaN  ...                                  487.0
2019-04-30T20:00:00Z 2019-04-30T22:00:00+0200                                 NaN  ...                                  447.0
2019-04-30T21:00:00Z 2019-04-30T23:00:00+0200                                 NaN  ...                                  410.0
2019-04-30T22:00:00Z 2019-05-01T00:00:00+0200                                 NaN  ...                                  400.0
2019-04-30T23:00:00Z 2019-05-01T01:00:00+0200                                 NaN  ...                                    NaN

[125593 rows x 391 columns]


In [7]: df['AL']
Out[7]:
                                               load_actual_entsoe_power_statistics
utc_timestamp        cet_cest_timestamp
2004-12-31T23:00:00Z 2005-01-01T00:00:00+0100                                  NaN
2005-01-01T00:00:00Z 2005-01-01T01:00:00+0100                                  NaN
2005-01-01T01:00:00Z 2005-01-01T02:00:00+0100                                  NaN
2005-01-01T02:00:00Z 2005-01-01T03:00:00+0100                                  NaN
2005-01-01T03:00:00Z 2005-01-01T04:00:00+0100                                  NaN
...                                                                            ...
2019-04-30T19:00:00Z 2019-04-30T21:00:00+0200                                  NaN
2019-04-30T20:00:00Z 2019-04-30T22:00:00+0200                                  NaN
2019-04-30T21:00:00Z 2019-04-30T23:00:00+0200                                  NaN
2019-04-30T22:00:00Z 2019-05-01T00:00:00+0200                                  NaN
2019-04-30T23:00:00Z 2019-05-01T01:00:00+0200                                  NaN

[125593 rows x 1 columns]

【讨论】：

这行得通！如果需要，我可以使用多级索引并在我的 abbv 列表中为 i 创建数据框。不过，我现在真的只想按国家/地区提取数据，所以这行得通！
这是一个很好的解决方案。 +1