【问题标题】:How to add another category in a DataFrame in python/pandas including only missing values?如何在 python/pandas 的 DataFrame 中添加另一个类别,仅包含缺失值?
【发布时间】:2018-07-30 15:52:54
【问题描述】:

我有一个包含两列的数据框:“TotalCharges”和包含 7043 行的“Churn”。在“TotalCharges”列的 11 个单元格中,我有一个缺失值。我想要的是创建 10 个类别的 TotalCharges 加上一个名为“MissingValues”的类别,但我找不到办法。我的 DataFrame 如下所示:

        TotalCharges Churn
0           29.85    No
1          1889.5    No
2          108.15   Yes
3         1840.75    No
4          151.65   Yes
5           820.5   Yes
6          1949.4    No
7           301.9    No
8         3046.05   Yes
9         3487.95    No
10         587.45    No
11          326.8    No
12         5681.1    No
13         5036.3   Yes
14        2686.05    No
15        7895.15    No
16        missing    No
17        7382.25    No
18         528.35   Yes
.... ....
.... ....

我想得到这样的东西:

        TotalCharges Churn TotalChargesCategories
0           29.85    No    (18.799, 84.61]
1          1889.5    No    (947.38, 1400.55]
2          108.15   Yes    (84.61, 267.37]
3         1840.75    No    (1400.55, 2065.52]
4          151.65   Yes    (84.61, 267.37]
5           820.5   Yes    (552.82, 947.38]
6          1949.4    No    (1400.55, 2065.52]
7           301.9    No    (267.37, 552.82]
8         3046.05   Yes    (2065.52, 3132.75]
9         3487.95    No    (3132.75, 4471.44]
10         587.45    No    (552.82, 947.38]
11          326.8    No    (267.37, 552.82]
12         5681.1    No    (4471.44, 5973.69]
13         5036.3   Yes    (4471.44, 5973.69]
14        2686.05    No    (2065.52, 3132.75]
15        7895.15    No    (5973.69, 8684.8]
16        missing    No     MissingValues
17        7382.25    No    (5973.69, 8684.8]
18         528.35   Yes    (267.37, 552.82]
.... ....
.... .... 

如果不存在缺失值,使用此代码会很容易:

width_bin = (pd.qcut(df.TotalCharges,10))
df = df.assign(TotalChargesCat=width_bin)
df

但由于有 11 个缺失值,我在创建类别时遇到问题,并且此代码会导致错误消息:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

【问题讨论】:

    标签: python pandas numpy categories data-science


    【解决方案1】:

    只需将missing 强制为NaN(通过显式替换或强制为数字dtype),然后像以前一样使用cut

    df['TotalChargesCategories'] = pd.cut(pd.to_numeric(df['TotalCharges'], errors='coerce'),10)
    
    >>> df
       TotalCharges Churn TotalChargesCategories
    0         29.85    No       (21.985, 816.38]
    1        1889.5    No     (1602.91, 2389.44]
    2        108.15   Yes       (21.985, 816.38]
    3       1840.75    No     (1602.91, 2389.44]
    4        151.65   Yes       (21.985, 816.38]
    5         820.5   Yes      (816.38, 1602.91]
    6        1949.4    No     (1602.91, 2389.44]
    7         301.9    No       (21.985, 816.38]
    8       3046.05   Yes     (2389.44, 3175.97]
    9       3487.95    No      (3175.97, 3962.5]
    10       587.45    No       (21.985, 816.38]
    11        326.8    No       (21.985, 816.38]
    12       5681.1    No     (5535.56, 6322.09]
    13       5036.3   Yes     (4749.03, 5535.56]
    14      2686.05    No     (2389.44, 3175.97]
    15      7895.15    No     (7108.62, 7895.15]
    16      missing    No                    NaN
    17      7382.25    No     (7108.62, 7895.15]
    18       528.35   Yes       (21.985, 816.38]
    

    【讨论】:

    • 请将后续问题作为对您问题的编辑或新问题发布(但不是作为答案)。作为下一个问题的解决方法,您可以执行您正在执行的操作,但不要使用a = df.TotalChargesCategories,而是使用a = df.TotalChargesCategories.astype('str'),它应该会获得您想要的输出。
    • 非常感谢,这对您有帮助,对于给您带来的不便,我深表歉意..
    • 没问题,很高兴为您提供帮助!如果一切顺利,请考虑通过单击投票数下方的复选标记来接受答案。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-03-12
    • 1970-01-01
    • 1970-01-01
    • 2022-08-13
    • 2022-06-11
    相关资源
    最近更新 更多