在附加列中合并两个 df 结果为 NaN答案

【问题标题】：Merging two df results in NaN in appended columns在附加列中合并两个 df 结果为 NaN
【发布时间】：2021-01-21 17:37:39
【问题描述】：

我已经为以下问题苦苦挣扎了很长一段时间，希望能得到任何帮助。

我想在“国家”上合并 df1 和 df2。

df1.head()

+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
|   |  loan_theme_id  | partner_id |               field_partner_name                |         loan_theme_type          |      location_name      |    lat    |    lon     | rural_pct |    city    |     region      |     country     |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
| 0 | a1050000000wDrQ |        175 | Koret Israel Economic Development Funds (KIEDF) | Underserved                      | Abu Sanaan, Israel      | 32.958030 | 35.171969  | 0.0       | Abu Sanaan | Israel          | Israel          |
| 1 | a1050000007S5Kt |        485 | Building Markets                                | SME                              | Yangon, Myanmar (Burma) | 16.866069 | 96.195132  | NaN       | Yangon     | Myanmar (Burma) | Myanmar (Burma) |
| 2 | a1050000002YCWe |        369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV)       | Artisan                          | Chajul, Guatemala       | 15.483483 | -91.037070 | NaN       | Chajul     | Guatemala       | Guatemala       |
| 3 | a1050000007qJuI |         77 | Al Majmoua                                      | Vulnerable Populations (Syrian)2 | Aley, Lebanon           | 33.810086 | 35.597326  | 43.0      | Aley       | Lebanon         | Lebanon         |
| 4 | a1050000006FnC9 |        357 | Alivio Capital                                  | Imagen Dental                    | Matamoros,Tamps, Mexico | 25.869029 | -97.502738 | 3.0       | Matamoros  | Tamps           | Mexico          |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+

这里是 df1 的列类型

Int64Index: 100 entries, 108 to 549
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   loan_theme_id       100 non-null    category
 1   partner_id          100 non-null    category
 2   field_partner_name  100 non-null    string  
 3   loan_theme_type     100 non-null    category
 4   location_name       100 non-null    string  
 5   lat                 100 non-null    float64 
 6   lon                 100 non-null    float64 
 7   rural_pct           79 non-null     float64 
 8   city                100 non-null    string  
 9   region              100 non-null    string  
 10  country             100 non-null    string  
dtypes: category(3), float64(3), string(5)
memory usage: 19.2 KB

df2.head()

+---+-------------+-------------------------+----------+
|   |   country   |      world_region       |   MPI    |
+---+-------------+-------------------------+----------+
| 0 | Afghanistan | South Asia              | 0.309853 |
| 1 | Albania     | Europe and Central Asia | NaN      |
| 2 | Algeria     | Arab States             | NaN      |
| 3 | Armenia     | Europe and Central Asia | NaN      |
| 4 | Azerbaijan  | Europe and Central Asia | NaN      |
+---+-------------+-------------------------+----------+

列类型：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 101
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country       102 non-null    string 
 1   world_region  102 non-null    object 
 2   MPI           78 non-null     float64
dtypes: float64(1), object(1), string(1)
memory usage: 3.2+ KB

确保至少有一些重叠：

display(df2[(df2.country == 'Guatemala')])

+----+-----------+-----------------------------+----------+
|    |  country  |        world_region         |   MPI    |
+----+-----------+-----------------------------+----------+
| 34 | Guatemala | Latin America and Caribbean | 0.113957 |
+----+-----------+-----------------------------+----------+

合并：

df3 = pd.merge(df1, df2, on='country', how='left')
df3.head()

+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
|   |  loan_theme_id  | partner_id |               field_partner_name                |         loan_theme_type          |      location_name      |    lat    |    lon     | rural_pct |    city    |     region      |     country     | world_region | MPI |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
| 0 | a1050000000wDrQ |        175 | Koret Israel Economic Development Funds (KIEDF) | Underserved                      | Abu Sanaan, Israel      | 32.958030 | 35.171969  | 0.0       | Abu Sanaan | Israel          | Israel          | NaN          | NaN |
| 1 | a1050000007S5Kt |        485 | Building Markets                                | SME                              | Yangon, Myanmar (Burma) | 16.866069 | 96.195132  | NaN       | Yangon     | Myanmar (Burma) | Myanmar (Burma) | NaN          | NaN |
| 2 | a1050000002YCWe |        369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV)       | Artisan                          | Chajul, Guatemala       | 15.483483 | -91.037070 | NaN       | Chajul     | Guatemala       | Guatemala       | NaN          | NaN |
| 3 | a1050000007qJuI |         77 | Al Majmoua                                      | Vulnerable Populations (Syrian)2 | Aley, Lebanon           | 33.810086 | 35.597326  | 43.0      | Aley       | Lebanon         | Lebanon         | NaN          | NaN |
| 4 | a1050000006FnC9 |        357 | Alivio Capital                                  | Imagen Dental                    | Matamoros,Tamps, Mexico | 25.869029 | -97.502738 | 3.0       | Matamoros  | Tamps           | Mexico          | NaN          | NaN |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+

列类型

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   loan_theme_id       100 non-null    category
 1   partner_id          100 non-null    category
 2   field_partner_name  100 non-null    string  
 3   loan_theme_type     100 non-null    category
 4   location_name       100 non-null    string  
 5   lat                 100 non-null    float64 
 6   lon                 100 non-null    float64 
 7   rural_pct           79 non-null     float64 
 8   city                100 non-null    string  
 9   region              100 non-null    string  
 10  country             100 non-null    string  
 11  world_region        0 non-null      object  
 12  MPI                 0 non-null      float64

我真的不明白为什么在 world_region 和 MPI 中的结果是 NaN。我确保国家/地区的 df1 和 df2 中没有 NaN，并且至少存在某种重叠。列类型也匹配。

编辑：感谢保罗，我试图检索有关例如的信息。 df1 中的“危地马拉”。我们可以在上表中看到它实际上存在于 df1 中。但是，运行 display(df2[(df2.country == 'Guatemala')]) 会返回一个空数据帧。所以我尝试运行 display(df2[(df2.country == 'Guatemala')])，开头有一个额外的空格，现在我们得到了一些结果：

+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
|   |  loan_theme_id  | partner_id |            field_partner_name             | loan_theme_type |   location_name   |    lat    |    lon    | rural_pct |  city  |  region   |  country  |
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
| 2 | a1050000002YCWe |        369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV) | Artisan         | Chajul, Guatemala | 15.483483 | -91.03707 | NaN       | Chajul | Guatemala | Guatemala |
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+

如果 pandas 中有一个函数来检查 df 列中的空格，这会导致问题？

【问题讨论】：

标签： python pandas merge

【解决方案1】：

您正在执行合并命令中left 关键字指定的左连接。这意味着，如果右侧数据框没有左侧一行所具有的国家/地区，您将获得 NaN。
有关联接类型和左联接的更多信息，请参见此处的示例：https://www.w3schools.com/sql/sql_join_left.asp

编辑：
这是因为在其中一个数据框中，字符串周围有一个额外的空格。在加入之前，您可以使用 trim() 函数删除空格。

【讨论】：

嗨，保罗，感谢您的回答。但是 df2 有“危地马拉”，因为 df1 在第 3 行也有“危地马拉”。至少在这种情况下，连接不应该有效吗？再次感谢您的宝贵时间和快速答复！
你是对的。在这种情况下，它应该可以工作。您确定 df1 Guatemala 或任何其他错字中没有多余的空间吗？ display(df1[(df1.country == 'Guatemala')]) 是否返回一行？
嗨，保罗，谢谢！你提出了正确的想法。你知道我将来可以用它以更好的方式检查它的功能吗？
也许您可以使用 trim 函数和用户 .lower() 删除前导和尾随空格，以便使所有内容都小写，而不再担心大写。但是，与其他拼写错误相比，这无济于事
嘿，保罗，谢谢，我会试试看。如果您愿意，请发布另一个关于空白额外空间的建议，以便我将其标记为正确答案。