如何找到混合类型的输入行答案

【问题标题】：How to find the input line with mixed types如何找到混合类型的输入行
【发布时间】：2018-05-29 21:26:00
【问题描述】：

我正在阅读大熊猫的大型 csv 文件：

features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])

我明白了：

sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  %!PS-Adobe-3.0

如何在输入中找到导致此警告的第一行？我需要这样做来调试输入文件的问题，它不应该有混合类型。

【问题讨论】：

请在此处查看类似问题 - stackoverflow.com/questions/24251219/…
@mm441 谢谢，但这似乎没有包含如何找到导致警告的行的答案吗？
您的文件有多大？如果足够小，“通过眼睛”可能是最快的方式。
@MadPhysicist 大约 400 万行。
让实习生去做吧:)

标签： python pandas

【解决方案1】：

for endrow in range(1000, 4000000, 1000):
    startrow = endrow - 1000
    rows = 1000
    try:
        pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
                names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
                       'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
                usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
                         'SrcPackets','DstPackets','SrcBytes','DstBytes'])
    except ValueError:
        print(f"Error is from row {startrow} to row {endrows}")

将文件拆分为多个数据框，每个数据框有 1000 行，以查看在哪个行范围内存在导致此问题的混合类型值。

【讨论】：

【解决方案2】：

Pandas 读完文件后，您可以不找出哪些行有问题（请参阅this answer 了解原因）。

这意味着您应该在阅读文件时找到一种方法。例如，逐行读取文件，并检查每一行的类型，如果其中任何一个与预期的类型不匹配，那么你就得到了想要的行。

要使用 Pandas 实现这一点，您可以将 chunksize=1 传递给 pd.read_csv() 以分块读取文件（在这种情况下，数据帧大小为 N，1）。如果您想了解更多信息，请参阅documentation。

代码如下：

# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)

# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.

i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
    row_types = [type(val) for val in row_df.iloc[0]]
    if row_types != expected_types:
        print(i) # this row is the wanted one
        break
    i += 1

请注意，此代码假设第一行具有“真实”类型。这段代码真的很慢，所以我建议您实际上只检查您认为有问题的列（尽管这不会带来太多的性能提升）。

【讨论】：