pandas numpy 错误：ValueError：int() 以 10 为底的无效文字：''答案

【问题标题】：pandas numpy error:ValueError: invalid literal for int() with base 10: ''pandas numpy 错误：ValueError：int() 以 10 为底的无效文字：''
【发布时间】：2022-01-05 21:40:08
【问题描述】：

我有一个数据框，您可以通过以下代码获得它：

import numpy as np
import pandas as pd
from io import StringIO
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df4s = """
    contract  RB  BeginDate  
0  A00118     46   20120705      
1  A00118     47   20121005      
2  A00253     48   0             
3  A00253     48   0                     
"""
df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s+', 
                  dtype={"BeginDate": int}
                  )

输出是：

contract    RB  BeginDate
0   A00118  46  20120705
1   A00118  47  20121005
2   A00253  48  0
3   A00253  48  0

现在我想根据'BeginDate'生成一个新的标题'first_month'，逻辑很简单，如果 BeginDate 等于 0 那么 first_month 将为 0，或者它将等于 BeginDate 的月份值，我的代码是：

df4['first_month'] = np.where(df4['BeginDate'] != 0,
                                    df4['BeginDate'].astype(str).str[4:6:1].astype(int), 0)

错误是：

ValueError: invalid literal for int() with base 10: ''

错误轨迹是：

:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
    707         # work around NumPy brokenness, #1987
    708         if np.issubdtype(dtype.type, np.integer):
--> 709             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    710 
    711         # if we have a datetime/timedelta array of objects

pandas\_libs\lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src\util.pxd in util.set_value_at_unsafe()

ValueError: invalid literal for int() with base 10: ''

输出应该是：

contract    RB  BeginDate  first_month
0   A00118  46  20120705   7
1   A00118  47  20121005   10
2   A00253  48  0          0
3   A00253  48  0          0

有朋友可以帮忙吗？

【问题讨论】：

标签： python pandas dataframe numpy

【解决方案1】：

来自np.where的文档：

x, y, array_like
可供选择的值。 x、y 和 condition 需要可广播到某种形状。

这意味着您对 x 的操作仅在条件为真时才被评估，而是在条件为真时被选中。因此，您的操作是在整个数据帧上执行的，而不是在子集（前 2 行）上。这就是您在第三行提出此错误的原因。

使用zfill 将“0”转换为“00000000”：

df4['first_month'] = df4['BeginDate'].astype(str).str.zfill(8).str[4:6:1].astype(int)
print(df4)

# Output
  contract  RB  BeginDate  first_month
0   A00118  46   20120705            7
1   A00118  47   20121005           10
2   A00253  48          0            0
3   A00253  48          0            0

【讨论】：

@William。我更新了我的答案。你能检查一下吗？我希望我的解释清楚:)

【解决方案2】：

Pandas 数据框可以直接过滤

df4[df4.BeginDate != 0]

或

df4.loc[df4.BeginDate != 0]

然后你可以添加你的逻辑来提取日期

df4['first_month'] = df4.loc[df4.BeginDate != 0].BeginDate.astype(str).str[4:6:1].astype(int)

请注意，这会将“NAN”放在第 2 行和第 3 行，而不是像您的示例中那样的“0”。

【讨论】：

谢谢你的回复，但是我需要row2和row3为0。
用 0 替换 NaN 应该很简单。我不是进行替换的最佳方法的专家，但这似乎有效（发出警告）df4.first_month[df4.first_month.isnull().values] = 0

【解决方案3】：

鉴于 BeginDate 列包含字符串对象，您可以通过以下方式获得所需的结果：

df['FirstMonth'] = df.apply(lambda row: 0 if row['BeginDate'] == '0' else int(row['BeginDate'][4:6]), axis = 1)

这会产生：

    Contract    RB  BeginDate   FirstMonth
0   A00118  46  20120705    7
1   A00118  47  20121005    10
2   A00253  48  0   0
3   A00253  48  0   0

【讨论】：

感谢您的回复，但是apply函数运行很慢，尤其是当数据量很大时。
这是一条很适合包含在您的问题陈述中的信息。我想如果您使用 df 的行数大于 1000 行，这可能是一个考虑因素。