pandas 自动将我的字符串列转换为浮点数答案

【问题标题】：pandas automatically converting my string column to floatpandas 自动将我的字符串列转换为浮点数
【发布时间】：2020-09-17 16:40:05
【问题描述】：

如何防止 pandas 将我的字符串值转换为浮点数。 Billing Doc. 和 Sales Order 列包含 10-11 位数字，这些数字将存储在 MySQL 表中的数据类型为 CHAR(15) 的列中。当我执行以下脚本时，我在每个数字的末尾看到.0。我想在我们的数据库中将它们视为字符串/字符。 Billing Doc. 字段包含像 3206790137, 3209056079, 3209763880, 3209763885, 3206790137 这样的数字，在 DB 中存储为 3206790137.0, 3209056079.0, 3209763880.0, 3209763885.0, 3206790137.0。数据库中Billing doc的列数据类型为CHAR(15)。

def insert_billing(df):
        df = df.where((pd.notnull(df)), None)
        for row in df.to_dict(orient="records"):
            bill_item = row['Bill.Item']
            bill_qty = row['Billed Qty']
            bill_doct_date = row['Billi.Doc.Date']
            bill_doc = row['Billing Doc.']
            bill_net_value = row['Billi.Net Value']
            sales_order = row['Sales Order']
            import_date = DT.datetime.now().strftime('%Y-%m-%d')


            query = "INSERT INTO sap_billing(" \
                    "bill_item, " \
                    "bill_qty, " \
                    "bill_doc_date, " \
                    "bill_doc, " \
                    "bill_net_value, " \
                    "sales_order, " \
                    "import_date" \
                    ") VALUES (" \
                    "\"{}\", \"{}\", \"{}\", \"{}\"," \
                    "\"{}\", \"{}\", \"{}\"" \
                    ") ON DUPLICATE KEY UPDATE " \
                    "bill_qty = VALUES(bill_qty), " \
                    "bill_doc_date = VALUES(bill_doc_date), " \
                    "bill_net_value = VALUES(bill_net_value), " \
                    "import_date = VALUES(import_date) " \
                    "".format(
                        bill_item,
                        bill_qty,
                        bill_doct_date,
                        bill_doc,
                        bill_net_value,
                        sales_order,
                        import_date
                        )
            query = query.replace('\"None\"', 'NULL')
            query = query.replace('(None', '(NULL')
            query = query.replace('\"NaT\"', 'NULL')
            query = query.replace('(NaT', '(NULL')

            try:
                q1 = gesdb_connection.execute(query)
            except Exception as e:
                print(bill_item, bill_doc, sales_order, e)



    if __name__ == "__main__":
        engine_str = 'mysql+mysqlconnector://root:abc123@localhost/mydb'

        file_name = "tmp/dataload/so_tracking.XLSX"
        df = pd.read_excel(file_name)
        if df.shape[1] == 35 and compare_columns(list(df.columns.values)) == 1:
            insert_billing(df)
        else:
            print("Incorrect column count, column order or column headers.\n")

当我创建一个简单的 df 并打印它时，问题不会出现。

import pandas as pd
df = pd.DataFrame({'Sales Order': [1217252835, 1217988754, 1219068439], 
                   'Billing Doc.': [3222102723, 3209781889, 3214305818]})
    >>> df
   Billing Doc.  Sales Order
0    3222102723   1217252835
1    3209781889   1217988754
2    3214305818   1219068439

但是，当我阅读 excel 然后打印它时，该列被读取为 float64。

 file_name = "tmp/dataload/so_tracking.XLSX"
    df = pd.read_excel(file_name)
    print(df['Billing Doc.'])

680    3.252170e+09
681    3.252170e+09
682    3.252170e+09
683    3.252170e+09
684    3.252170e+09
685    3.252170e+09
686    3.252170e+09
687    3.252170e+09
688    3.252170e+09
689    3.252170e+09
690    3.252170e+09
.
.
.
694    3.251601e+09
695    3.251631e+09
696    3.252013e+09
697             NaN
698    3.252272e+09
699    3.252360e+09
700    3.252474e+09
.
.
Name: Billing Doc., dtype: float64

【问题讨论】：

你能把它提炼成一个可重复的例子吗？没有其他人可以访问您的数据库或电子表格。因此，任何提供帮助的尝试都只是猜测。
Pandas 纯粹主义者可能不喜欢这种快速修复，但我使用 pd.read_csv('file.csv', dtype = object)，它可以防止 pandas 将数字转换为浮点数。我相当确定您可以将 read_csv() 替换为其他 DataFrame 创建函数。
@PaulH 我添加了一个示例。
@pshep123。感谢您的输入。这是一个 .xlsx 文件，当我从其他地方得到它时无法转换为 csv。
不，我建议您使用 dtype = object 作为 read_excel() 参数。我还没有测试过，但可能值得一试。

标签： python python-3.x pandas

【解决方案1】：

我自己找到了解决方案，在此处发布以记录它。

df = pd.read_excel(file_name, converters={'Billing Doc.' : str})
print(df['Billing Doc.'])

695    3251631331
696    3252012614
697           NaN
698    3252272451
699    3252359504
700    3252473894
701           NaN
702           NaN
703           NaN
704    3252652940
705           NaN
706           NaN
707           NaN
708           NaN
Name: Billing Doc., dtype: object

【讨论】：

【解决方案2】：

类似的事情发生在我身上，因为新列的索引与原始数据帧的索引不匹配，导致出现 NaN，从而导致自动广播浮动。因此，检查是否：

原始数据框的索引和新列匹配
新列包含 NaN。

【讨论】：

【解决方案3】：

试试这个：

df = df.astype(str)

请注意，这是非常无效的

或在将每个值插入查询之前将它们转换为int

【讨论】：