在 CSV 中查找每列的“最强”类型答案

【问题标题】：Find 'strongest' type per column in CSV在 CSV 中查找每列的“最强”类型
【发布时间】：2018-10-06 01:46:13
【问题描述】：

我需要按列扫描 CSV 并找到最强的数据类型，然后将其应用于整个列。

例如，如果我有一个看起来像这样的 CSV（是的，我没有逗号...）：

    + C1 + C2 + C3 + C4
R1  | i  | s  | i  | f
R2  | i  | f  | i  | i
R3  | i  | i  | s  | f

# i = int
# f = float
# s = str

C1 的“最强”类型是 i，C2 是 s，C3 是 s，C4 是 f。

因此“强度”的顺序是str > float > int。

为什么？因为我正在写入这些值的文件类型明确要求为字段（它的列）指定的数据类型与该数据类型匹配（即，如果该字段设置为FLOAT，我不能在其中放置str列，否则文件无效）。

为此，我正在执行以下操作：

对于每个文件，逐行读取文件并检查每一列；存储“最强”类型
创建一个包含新类型转换行的新容器

第 2 项使用字典和列表理解非常简单：

types = {header: None for header in r.fieldnames}
# read file and store "strongest" found in 'types[header]' per column
# ...
typed = [[types[header](row[header]) for header in types] for row in rows]
# note: types[header] value is a function alias (i.e. int vs int())

第 1 项是大部分繁重工作发生的地方：

for row in r: # r is a csv.DictReader
    rows.append(row) # list of OrderedDicts since r is a generator
    # problematic because I have to keep checking just to append...
    if all(types[header] is str for header in types):
        continue # all 'str' so stop checking

    for header in types:
        if types[header] is str:
            continue # whole column can be bypassed from now on

        # function just type casts 'int' or 'float' on string by ValueError
        t = self.find_type(row[header])
        if (types[header] is int) and (t is float):
            types[header] = t # float > int since all int's can be represented as float
        elif (types[header] is float) and (t is int):
            pass # int < float so do nothing
        else:
            types[header] = t # if 'str' will be caught later by first if

执行此操作的最坏情况是 CSV 中的行数，因为最后一行可能包含有效的 str 类型测试。

有没有更有效的方法来做到这一点，也许是pandas（目前使用不多）？

解决方案：

from numpy import issubdtype
from numpy import integer
from numpy import floating
from pandas import read_csv
from shapefile import Writer # PyShp library

df = read_csv('/some/file', low_memory = False)

rows = df.values.tolist() # fastest access over df.iterrows()

w = Writer(5, True)

# This is the core of the question
# I can access df[col].dtype but I didn't want to use str == str
# If this can be done better than subtype check let me know
for col in df:
    if issubdtype(df[col], integer): 
        w.field(col, 'N', 20, 0)
    elif issubdtype(df[col][0], floating):
        w.field(col, 'F', 20, 10)
    else:
        w.field(col, 'C', 40, 0)

# Alternatively (1):
# from numpy import int64
# from numpy import float64
# for col in df:
#     if df[col].dtype.type is int64: 
#         w.field(col, 'N', 20, 0)
#     elif df[col].dtype.type is float64:
#         w.field(col, 'F', 20, 10)
#     else:
#         w.field(col, 'C', 40, 0)

# Alternatively (2):
# Don't import numpy directly in namespace
# for col in df:
#     if df[col].dtype == 'int64': 
#         w.field(col, 'N', 20, 0)
#     elif df[col].dtype == 'float64':
#         w.field(col, 'F', 20, 10)
#     else:
#         w.field(col, 'C', 40, 0)


lon = df.columns.get_loc('LON')
lat = df.columns.get_loc('LAT')

for row in rows:
    w.point(row[lon], row[lat])
    w.record(*row)

w.save('/some/outfile')

【问题讨论】：

嗯.... continue # all 'str' so stop checking 不应该是 break 所以你真的停止检查了吗？
要将此视为 CSV，所有行中必须有一个唯一的分隔符。在你的标题中，你有'+'作为分隔符，而在其他行中它是'|'。 pandas 和 CSV 模块都不能按原样读取它
@JonClements 是的，应该，但这是我现在的一个缺陷。否则rows 将不包含DictReader 中的剩余行。所以我只是跳过标题检查并继续复制产生的OrderedDicts。
@akshat 这是一个 CSV，这只是一个视觉表示。因此，我在正文 "(yes I not there are no commas...)" 中发表了评论。如果我将文件放入 Excel 中进行可视化，则不会有（可视的）唯一分隔符。这只是帮助您了解我的问题集。
你的 csv 有多大？

标签： python python-3.x pandas csv

【解决方案1】：

一个示例数据框：

In [11]: df
Out[11]:
    C1  C2 C3    C4
R1   1   a  6   8.0
R2   2  4.  7   9.0
R3   3   5  b  10.0

我不会尝试对任何短路评估保持聪明。我只需要每个条目的类型：

In [12]: df_types = df.applymap(type)

In [13]: df_types
Out[13]:
               C1             C2             C3               C4
R1  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>
R2  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>
R3  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>

如果您枚举这些类型，您可以使用max：

In [14]: d = {ch: i for i, ch in enumerate([int, float, str])}

In [15]: d_inv = {i: ch for i, ch in enumerate([int, float, str])}

In [16]: df_types.applymap(d.get)
Out[16]:
    C1  C2  C3  C4
R1   0   2   2   1
R2   0   2   2   1
R3   0   2   2   1

In [17]: df_types.applymap(d.get).max()
Out[17]:
C1    0
C2    2
C3    2
C4    1
dtype: int64

In [18]: df_types.applymap(d.get).max().apply(d_inv.get)
Out[18]:
C1      <class 'int'>
C2      <class 'str'>
C3      <class 'str'>
C4    <class 'float'>
dtype: object

现在，您可以遍历每一列并在 df 中更新它（最大）：

In [21]: for col, typ in df_types.applymap(d.get).max().apply(d_inv.get).iteritems():
             df[col] = df[col].astype(typ)


In [22]: df
Out[22]:
    C1  C2 C3    C4
R1   1   a  6   8.0
R2   2  4.  7   9.0
R3   3   5  b  10.0

In [23]: df.dtypes
Out[23]:
C1      int64
C2     object
C3     object
C4    float64
dtype: object

如果您通过按类型分组并更新成批的列（例如一次所有字符串列）有许多列，这可能会稍微更有效。

【讨论】：

虽然实际上是在pandas中，但是当你读到这个时，它会将int转换为int64列，将float转换为float64，所以你只需要担心object列
OPs 源数据是 CSV...所以我想只使用.read_csv(..., low_memory=False) 然后访问结果帧的.dtypes 并将object 视为str 就可以了...
我有限的熊猫经验可能会在这里阻碍我。但是当使用csv 模块读入时，CSV 只不过是字符串值。所以看起来像1, 2, 3, 4.5, abc, 1.1 的行实际上是'1', '2', '3', '4.5', 'abc', '1.1'。 pandas 是否对.read_csv() 进行自动类型测试？
@pstatix pandas' read_csv 尝试推断类型 - 类似于您手动执行的操作...事实上 - 它将应用与您相同的优先级...任何字符串仍然存在作为一个字符串和浮动王牌整数。
它不会做的一件事是将[1, 2, 'c'] 转换为['1', '2', 'c']，即对象列不一定会转换为字符串。