在python中合并某些列相同而其他列不同的csv文件答案

【问题标题】：Merging csv files with some columns same and others different in python在python中合并某些列相同而其他列不同的csv文件
【发布时间】：2020-02-10 18:06:37
【问题描述】：

我是编码新手，在合并 csv 文件时遇到问题。我搜索了类似的问题，但没有找到解决方法。只是包括一些相关的细节： CSV 文件是 1950 - 2017 年期间不同国家的癌症类型（肺癌、结直肠癌、胃癌、肝癌和乳腺癌）以下是肺癌的布局示例。

 dlung.describe(include='all')   
 dlung


    Year    Cancer  Country     Gender  ASR     SE
0   1950    Lung    Australia   Male    13.89   0.56
1   1951    Lung    Australia   Male    14.84   0.57
2   1952    Lung    Australia   Male    17.19   0.61
3   1953    Lung    Australia   Male    18.21   0.62
4   1954    Lung    Australia   Male    19.05   0.63
5   1955    Lung    Australia   Male    20.65   0.65
6   1956    Lung    Australia   Male    22.05   0.67
7   1957    Lung    Australia   Male    23.93   0.69
8   1958    Lung    Australia   Male    23.77   0.68
9   1959    Lung    Australia   Male    26.12   0.71
10  1960    Lung    Australia   Male    27.08   0.72

我有兴趣根据共享列（年份、国家/地区）将所有癌症类型加入一个数据框。我尝试了不同的方法，但它们似乎都重复了年份和国家（如下）

这个还不错，但是我有两列分别代表年份和国家

df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)

df_lung_colorectal 

Year    Cancer  Country Gender  ASR SE  Year    Cancer  Country Gender  ASR SE

如果我继续这样下去，我将得到 5 个相同的 YEAR 列和 5 个 COUNTRY 列。

关于如何将所有独立的值（癌症类型和相关的 ASR（标准化风险）以及 SE 值）合并为 YEAR、COUNTRY（和 GENDER）的一列（如果可能）有什么想法吗？

【问题讨论】：

标签： python pandas csv merge

【解决方案1】：

是的，如果使用DataFrame.set_index 是可能的，但随后会重复另一个列名称：

print (dlung)
   Year Cancer    Country Gender    ASR    SE
0  1950   Lung  Australia   Male  13.89  0.56
1  1951   Lung  Australia   Male  14.84  0.57
2  1952   Lung  Australia   Male  17.19  0.61
3  1953   Lung  Australia   Male  18.21  0.62
4  1954   Lung  Australia   Male  19.05  0.63

print (dcolorectal)
    Year      Cancer    Country Gender    ASR    SE
6   1950  colorectal  Australia   Male  22.05  0.67
7   1951  colorectal  Australia   Male  23.93  0.69
8   1952  colorectal  Australia   Male  23.77  0.68
9   1953  colorectal  Australia   Male  26.12  0.71
10  1954  colorectal  Australia   Male  27.08  0.72

df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']), 
                                dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)

print (df_lung_colorectal)
                      Cancer    ASR    SE      Cancer    ASR    SE
Year Country   Gender                                             
1950 Australia Male     Lung  13.89  0.56  colorectal  22.05  0.67
1951 Australia Male     Lung  14.84  0.57  colorectal  23.93  0.69
1952 Australia Male     Lung  17.19  0.61  colorectal  23.77  0.68
1953 Australia Male     Lung  18.21  0.62  colorectal  26.12  0.71
1954 Australia Male     Lung  19.05  0.63  colorectal  27.08  0.72

但我认为最好先将所有DataFrame与axis=0连接起来，默认值是多少，所以应该删除并最后由DataFrame.set_index和DataFrame.unstack重塑：

df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   Year    Country Gender  ASR_Lung  ASR_colorectal  SE_Lung  SE_colorectal
0  1950  Australia   Male     13.89           22.05     0.56           0.67
1  1951  Australia   Male     14.84           23.93     0.57           0.69
2  1952  Australia   Male     17.19           23.77     0.61           0.68
3  1953  Australia   Male     18.21           26.12     0.62           0.71
4  1954  Australia   Male     19.05           27.08     0.63           0.72

【讨论】：

我认为@jezrael 的回答对您的问题非常全面，但您还应该考虑：您的格式是否与“整洁数据”格式一致？只是我的 50 美分：ibm.com/developerworks/community/blogs/jfp/entry/…
感谢您的回复 jezrael。这很棒！另外，感谢 Quant Christo。我同意它不是最好的格式。我将重新格式化以将所有变量分组，而不是像上面那样分开！

【解决方案2】：

与axis=0 连接以逐行合并它们。

使用axis=1，您要求它并排连接。

【讨论】：