连接 Pandas DataFrames 双行答案

【问题标题】：Concatenating Pandas DataFrames Doubling Rows连接 Pandas DataFrames 双行
【发布时间】：2019-10-18 06:28:48
【问题描述】：

我正在尝试 concat() 熊猫中的两个 DataFrame。其中一个数据框只是我从另一个数据框中获取并转换的一些列，所以我绝不会使用它们。但是当我尝试连接它们时，我得到一个错误，说它们不能连接在一起，所以它们几乎是对角连接的，行数加倍（因为每个都有相同的行）并且列数增加一列加上另一个。

理想情况下，我希望行数保持不变，列数是一个中的列加上另一个中的列。以下是我的代码：

## In the below code I create new names for the scaled fields by adding SC_ to 
## their existing names
SC_ExplanVars = []

for var in explan_vars:
    sc_var= "SC_" + var
    SC_ExplanVars.append(sc_var)

## Scale the columns from my dataframe that will be used as explanatory 
## variables
X_Scale = preprocessing.scale(data[ExplanVars])

## Put my newly scaled explanatory variables into a DataFrame with same headers
## but with SC_ infont
X_Scale = pd.DataFrame(X_Scale, columns = SC_ExplanVars)

## Concatenate scaled variables onto original dataset
datat = pd.concat([data, X_Scale], axis=1)

我收到警告：

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\api.py:77: RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  result = result.union(other)

编辑

下面是我所描述的表格。它只是前 10 行，我已将其更改为仅一列，但似乎仍然给我同样的问题

Data=
    Col1
    297
    297
    297
    297
    275
    275
    275
    400
    400
    400

X_Scale = 
SC_Col1
-0.4644471998668502
-0.4644471998668502
-0.4644471998668502
-0.4644471998668502
-0.8849343767010354
-0.8849343767010354
-0.8849343767010354
1.5041973098568349
1.5041973098568349
1.5041973098568349

连接后

datat = 
Col1    SC_Col1
297.0   NaN
297.0   NaN
297.0   NaN
297.0   NaN
275.0   NaN
275.0   NaN
275.0   NaN
400.0   NaN
400.0   NaN
400.0   NaN
NaN -0.4644471998668502
NaN -0.4644471998668502
NaN -0.4644471998668502
NaN -0.4644471998668502
NaN -0.8849343767010354
NaN -0.8849343767010354
NaN -0.8849343767010354
NaN 1.5041973098568349
NaN 1.5041973098568349
NaN 1.5041973098568349

【问题讨论】：

您能否展示您的数据框样本并发布MCVE？由于您没有说出explan_vars、data、preprocessing 是什么，因此无法重现错误...
使用您在编辑中发布的两个数据框可以正常工作。我无法重现您的行为：我的串联数据框中有两列、十行且没有 NaN。我只能认为问题出在之前的某个地方。从警告中，也许你有一些是字符串的整数。
你试过用ignore_index=True做concat吗？

标签： python-3.x pandas dataframe scikit-learn

【解决方案1】：

可能有不同的索引标签，在连接之前尝试在每个数据帧中使用reset_index()：

例如，我有这 2 个具有不同索引名称的数据框并尝试 concat 他们：

d1={'Col1':[297,297,297,297,275,275,275,400,400,400]}
d2={'SC_Col1': [-0.4644471998668502,-0.4644471998668502,-0.4644471998668502,-0.4644471998668502,-0.8849343767010354,-0.8849343767010354,-0.8849343767010354,1.5041973098568349,1.5041973098568349,1.5041973098568349]}

df1=pd.DataFrame(d1, index=[10,11,12,13,14,15,16,17,18,19])
df2=pd.DataFrame(d2)
print(pd.concat([df1, df2], axis=1))

输出：

     Col1   SC_Col1
0     NaN -0.464447
1     NaN -0.464447
2     NaN -0.464447
3     NaN -0.464447
4     NaN -0.884934
5     NaN -0.884934
6     NaN -0.884934
7     NaN  1.504197
8     NaN  1.504197
9     NaN  1.504197
10  297.0       NaN
11  297.0       NaN
12  297.0       NaN
13  297.0       NaN
14  275.0       NaN
15  275.0       NaN
16  275.0       NaN
17  400.0       NaN
18  400.0       NaN
19  400.0       NaN

在concat()操作之前使用reset_index()和参数drop=True后，数据框将如下所示：

df1=df1.reset_index(drop=True)
df2.reset_index(drop=True)
print(pd.concat([df1, df2], axis=1))

输出：

   Col1   SC_Col1
0   297 -0.464447
1   297 -0.464447
2   297 -0.464447
3   297 -0.464447
4   275 -0.884934
5   275 -0.884934
6   275 -0.884934
7   400  1.504197
8   400  1.504197
9   400  1.504197

希望这可以帮助你:)

【讨论】：

感谢 ALFAFA，它似乎有效。你知道索引会如何变化吗？
Col1 列之前似乎有一些空格，SC_Col1 列之前没有任何空格。为确保索引标签存在差异，您可以使用data.index 和X_Scale.index