Pandas 尝试将字符串转换为整数时出错答案

【问题标题】：Pandas error trying to convert string into integerPandas 尝试将字符串转换为整数时出错
【发布时间】：2017-01-29 05:10:46
【问题描述】：

要求：

DataFrame 中的一个特定列是“混合”类型。它可以具有 "123456" 或 "ABC12345" 之类的值。

正在使用 xlsxwriter 将该数据框写入 Excel。

对于像"123456" 这样的值，Pandas 将其转换为123456.0（使其看起来像一个浮点数）

我们需要将它作为 123456（即 + 整数）放入 xlsx 中，以防值是全数字。

努力：

代码片段如下所示

import pandas as pd
import numpy as np
import xlsxwriter
import os
import datetime
import sys
excel_name = str(input("Please Enter Spreadsheet Name :\n").strip())

print("excel entered :   "   , excel_name)
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
         'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
         'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
         'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
         'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
          'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
          'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
first_pass_drop_duplicate = df_m_d.drop_duplicates(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
                                   'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
                                   'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep=False) 
# We need to keep integer AltID  as is

first_pass_drop_duplicate.loc[first_pass_drop_duplicate['AltID']] =   first_pass_drop_duplicate['AltID'].apply(lambda x : str(int(x)) if str(x).isdigit() == True else x)

我试过了：

1. using `dataframe.astype(int).astype(str)` # works as long as value is not alphanumeric
2.importing re and using pure python `re.compile()` and `replace()` -- does not work
3.reading DF row by row in a for loop !!! Kills the machine as dataframe can have 300k+ records

每次都会出错：

raise KeyError('%s not in index' % objarr[mask])
KEYERROR：'[102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.10711.102711.102711.102711.102711. 102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.10711. 102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711.102711。 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 537.533 \n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 21 24. 2124. 2124. 2124. 2124. 2124.\n 2124. 2124. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. \n64.663. 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.] 不在索引中'

我是 python/pandas 的新手，非常感谢任何帮助，解决方案。

【问题讨论】：

所以您只需要将数值转换为float 而不是非数值？
我需要确保它将 + 整数视为 TEXT/STRING，并且不会在最后添加实际显示在 excel 中的 .0（小数点）。
所以您需要将所有值转换为typestring？问题是Excel解析int值转换为string为float？
没错，AltID 中的任何值在 Pandas 中都应该被视为字符串
我试过你的解决方案 `first_pass_drop_duplicate.ix[first_pass_drop_duplicate.AltID.str.isdigit(), 'AltID'] = pd.to_numeric(first_pass_drop_duplicate.AltID, errors='coerce')

标签： python string pandas casting int

【解决方案1】：

我觉得你需要to_numeric:

df = pd.DataFrame({'AltID':['123456','ABC12345','123456'],
                   'B':[4,5,6]})

print (df)
      AltID  B
0    123456  4
1  ABC12345  5
2    123456  6

df.ix[df.AltID.str.isdigit(), 'AltID']  = pd.to_numeric(df.AltID, errors='coerce')

print (df)
      AltID  B
0    123456  4
1  ABC12345  5
2    123456  6

print (df['AltID'].apply(type))
0    <class 'float'>
1      <class 'str'>
2    <class 'float'>
Name: AltID, dtype: object

【讨论】：

太棒了！它不适用于我的系列，因为第 4 个元素已经是 int。 pd.Series([1], dtype=object).str.isdigit() 返回NaN。我必须这样做：s.ix[s.str.isdigit().fillna(False)] = pd.to_numeric(s, errors='coerce')，而且效果很好。
还有！这几乎肯定会更快。
@piRSquared - 谢谢。另一个解决方案是df.ix[df.AltID.astype(str).str.isdigit(), 'AltID'] = pd.to_numeric(df.AltID, errors='coerce')

【解决方案2】：

将apply 和pd.to_numeric 与参数errors='ignore' 一起使用

考虑pd.Seriess

s = pd.Series(['12345', 'abc12', '456', '65hg', 54, '12-31-2001'])

s.apply(pd.to_numeric, errors='ignore')

0         12345
1         abc12
2           456
3          65hg
4            54
5    12-31-2001
dtype: object

注意类型

s.apply(pd.to_numeric, errors='ignore').apply(type)

0    <type 'numpy.int64'>
1            <type 'str'>
2    <type 'numpy.int64'>
3            <type 'str'>
4            <type 'int'>
5            <type 'str'>
dtype: object

【讨论】：

【解决方案3】：

最后它通过在 pandas read_excel 格式中使用“转换器”选项来工作

df_w02 = pd.read_excel(excel_name, names = df_header,converters = {'AltID':str,'RatingReason' : str}).fillna("")

转换器可以“转换”一个由我的函数/值定义的类型，并将 intefer 存储为字符串而不添加小数点。

【讨论】：