Pandas read_excel函数忽略dtype答案

【问题标题】：Pandas read_excel function ignoring dtypePandas read_excel函数忽略dtype
【发布时间】：2021-05-21 10:07:53
【问题描述】：

我正在尝试使用 pd.read_excel() 读取 excel 文件。 excel 文件有 2 列日期和时间，我想将两列都读取为 str 而不是 excel dtype。

excel文件示例

我尝试指定 dtype 或转换器参数无济于事。

df = pd.read_excel('xls_test.xlsx',
                   dtype={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

如您所见，Date 列不被视为 str...

使用转换器时也是如此

df = pd.read_excel('xls_test.xlsx',
                   converters={'Date':str,'Time':str})
df.dtypes
Date    object
Time    object
dtype: object

df.head()
Date    Time
0   2020-03-08 00:00:00 10:00:00
1   2020-03-09 00:00:00 11:00:00
2   2020-03-10 00:00:00 12:00:00
3   2020-03-11 00:00:00 13:00:00
4   2020-03-12 00:00:00 14:00:00

我也尝试过使用其他引擎，但结果总是一样。

在读取 csv 时，dtype 参数似乎按预期工作

我在这里做错了什么？？

编辑：我忘了说，我使用的是最新版本的 pandas 1.2.2，但在从 1.1.2 更新之前遇到了同样的问题。

【问题讨论】：

好像有bug，你试试最新的pandas版本吗？
是的，使用最后一个版本，但我在 1.1.2 上遇到了同样的问题
我在 1.3.1 版遇到了类似的问题
试试这个stackoverflow.com/questions/32591466/…

标签： python excel pandas

【解决方案1】：

您遇到的问题是 excel 中的单元格具有数据类型。所以这里的数据类型是日期或时间，它被格式化为仅用于显示。 “直接”加载它意味着加载一个日期时间类型*。

这意味着，无论您如何使用 dtype= 参数，数据都将作为日期加载，然后转换为字符串，从而为您提供您所看到的结果：

>>> pd.read_excel('test.xlsx').head()
        Date      Time            Datetime
0 2020-03-08  10:00:00 2020-03-08 10:00:00
1 2020-03-09  11:00:00 2020-03-09 11:00:00
2 2020-03-10  12:00:00 2020-03-10 12:00:00
3 2020-03-11  13:00:00 2020-03-11 13:00:00
4 2020-03-12  14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date        datetime64[ns]
Time                object
Datetime    datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
                  Date      Time             Datetime
0  2020-03-08 00:00:00  10:00:00  2020-03-08 10:00:00
1  2020-03-09 00:00:00  11:00:00  2020-03-09 11:00:00
2  2020-03-10 00:00:00  12:00:00  2020-03-10 12:00:00
3  2020-03-11 00:00:00  13:00:00  2020-03-11 13:00:00
4  2020-03-12 00:00:00  14:00:00  2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date        string
Time        string
Datetime    string
dtype: object

仅在 csv 文件中，日期时间数据以字符串形式存储在文件中。在那里，将它“直接”作为字符串加载是有意义的。在excel文件中，你也可以将其加载为日期并格式化为.dt.strftime()

这并不是说您无法加载已格式化的数据，但您需要 2 个步骤：

加载数据
重新应用格式

格式化类型之间需要做一些转换，你不能直接使用 pandas - 但是你可以使用 pandas 作为后端的引擎：

import datetime
import openpyxl
import re

date_corresp = {
    'dd': '%d',
    'mm': '%m',
    'yy': '%y',
    'yyyy': '%Y',
}

time_corresp = {
    'hh': '%h',
    'mm': '%M',
    'ss': '%S',
}

def datecell_as_formatted(cell):
    if isinstance(cell.value, datetime.time):
        dfmt, tfmt = '', cell.number_format
    elif isinstance(cell.value, (datetime.date, datetime.datetime)):
        dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
    else:
        raise ValueError('Not a datetime cell')

    for fmt in re.split(r'\W', dfmt):
        if fmt:
            dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)

    for fmt in re.split(r'\W', tfmt):
        if fmt:
            tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)

    return cell.value.strftime(dfmt + tfmt)

然后您可以按如下方式使用：

>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'

（如果_corresp字典不完整，您也可以使用更多日期/时间格式项目来完成）

_{* 它存储为浮点数，即自 1900 年 1 月 1 日以来的天数，您可以通过将日期格式化为数字或this excelcampus page 来查看。}

【讨论】：

【解决方案2】：

就像其他 cmets 所说的那样，这个问题很可能是一个错误

虽然不理想，但你总能做到这样吗？

import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str}) 
# this line can be then simplified to : 
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)

                     Date        Time
0   "2020-03-08 00:00:00"  "10:00:00"
1   "2020-03-09 00:00:00"  "11:00:00"
2   "2020-03-10 00:00:00"  "12:00:00"
3   "2020-03-11 00:00:00"  "13:00:00"
4   "2020-03-12 00:00:00"  "14:00:00"
5   "2020-03-13 00:00:00"  "15:00:00"
6   "2020-03-14 00:00:00"  "16:00:00"
7   "2020-03-15 00:00:00"  "17:00:00"
8   "2020-03-16 00:00:00"  "18:00:00"
9   "2020-03-17 00:00:00"  "19:00:00"
10  "2020-03-18 00:00:00"  "20:00:00"
11  "2020-03-19 00:00:00"  "21:00:00"
object
object

【讨论】：

这也是我一贯的做法。
为什么不使用 strftime？例如。 df.Date.apply(lambda x: x.strftime('%Y-%m-%d'))

【解决方案3】：

这是一个简单的解决方案，即使您在 dtype 中应用“str”，它也只会作为对象返回。使用以下代码将列读取为字符串 Dtype。

df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})

要了解更多关于 Pandas 字符串 Dtype 的信息，请使用下面的链接，

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

如果您对此有任何问题，请告诉我！

【讨论】：

【解决方案4】：

从1.0.0版本开始，pandas中有两种存储文本数据的方式：object或StringDtype（source）。

从 1.1.0 版本开始，StringDtype 现在适用于 astype(str) 或 dtype=str 工作的所有情况 (source)。

现在可以将所有 dtype 转换为 StringDtype

您只需在使用 pandas 加载数据时指定 dtype="string"：

>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date    string
Time    string
dtype: object

【讨论】：