Python：使用 np.genfromtxt 读取 CSV 导致列数不同答案

【问题标题】：Python : Reading CSV using np.genfromtxt resulting in different number of columnsPython：使用 np.genfromtxt 读取 CSV 导致列数不同
【发布时间】：2018-04-03 03:58:17
【问题描述】：

我正在使用 np.genfromtxt 读取 csv。我不确定它为什么会在数据上引发 ValueError(errmsg) 。当我在 excel 中读取文件时，它显示文件中所有 33 行的总共 23 列

这里是代码和错误：

csv = np.genfromtxt (fname, delimiter=",",names=True)

这是 csv 记录的 sn-p：

,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_NN__alpha,param_NN__hidden_layer_sizes,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.34166226387023924,0.0010362625122070312,0.842927342927343,0.8468980402379758,0.1,"(7,)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7,)}",25,0.8420706295240185,0.8475292052871167,0.8398771660451854,0.8463774474853288,0.845360824742268,0.846158065046893,0.8385256691531373,0.8486892618185806,0.8488040377441299,0.8457362215519605,0.05093153997183547,0.00018195987247183776,0.0037378988316037944,0.0010747322296072162
1,0.5543142318725586,0.0018250465393066407,0.8465250965250966,0.8527554135893668,0.1,"(25, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 7)}",5,0.846018863785918,0.8530137662480118,0.846018863785918,0.8589919376953875,0.8479929809168677,0.8496681840618658,0.8400614304519526,0.851486234506965,0.8525345622119815,0.8506169454346038,0.10835399357094619,0.00018853748087819175,0.004013613789285713,0.003306836154659678
2,0.5266880512237548,0.0013680458068847656,0.8437609687609687,0.8478413817137904,0.1,"(11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (11, 7)}",17,0.842509322219785,0.8479679701639884,0.8354902390875192,0.8431964021280096,0.8455801710901514,0.8520265452750507,0.8433523475208424,0.851595919710431,0.8518762343647136,0.8444200712914725,0.1041624682160838,0.0003233587082439388,0.005278162504355272,0.0036030369022985215
3,0.49459095001220704,0.0011162281036376954,0.8406458406458407,0.845428443186931,0.1,"(7, 5)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7, 5)}",32,0.8383417416100022,0.848461580650469,0.8429480149155516,0.8501617945483464,0.8468962491774512,0.8514780891789612,0.8312856516015796,0.8381046396841066,0.8437568575817423,0.8389361118727722,0.10397613499936685,0.00018889068500539376,0.005421511394261151,0.005726975087304059
4,0.6175418376922608,0.0024899959564208983,0.8449017199017199,0.8508140227747922,0.1,"(25, 11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 11, 7)}",11,0.8414125904803685,0.8493939560138211,0.8427286685676684,0.8546591345362804,0.8501864443957008,0.8519716996654417,0.8459850811759544,0.8564769112646704,0.8441957428132544,0.8415684123937482,0.1940231074769015,0.00047604030307216253,0.003049662553913791,0.005209439647677219

收到错误：

ValueError: Some errors were detected !
    Line #2 (got 26 columns instead of 22)
    Line #3 (got 26 columns instead of 22)
    Line #4 (got 26 columns instead of 22)
    Line #5 (got 26 columns instead of 22)
    Line #6 (got 28 columns instead of 22)
    Line #7 (got 26 columns instead of 22)
    Line #8 (got 28 columns instead of 22)
    Line #9 (got 26 columns instead of 22)
    Line #10 (got 26 columns instead of 22)
    Line #11 (got 26 columns instead of 22)
    Line #12 (got 26 columns instead of 22)
    Line #13 (got 26 columns instead of 22)
    Line #14 (got 28 columns instead of 22)
    Line #15 (got 26 columns instead of 22)
    Line #16 (got 28 columns instead of 22)
    Line #17 (got 26 columns instead of 22)
    Line #18 (got 26 columns instead of 22)
    Line #19 (got 26 columns instead of 22)
    Line #20 (got 26 columns instead of 22)
    Line #21 (got 26 columns instead of 22)
    Line #22 (got 28 columns instead of 22)
    Line #23 (got 26 columns instead of 22)
    Line #24 (got 28 columns instead of 22)
    Line #25 (got 26 columns instead of 22)
    Line #26 (got 26 columns instead of 22)
    Line #27 (got 26 columns instead of 22)
    Line #28 (got 26 columns instead of 22)
    Line #29 (got 26 columns instead of 22)
    Line #30 (got 28 columns instead of 22)
    Line #31 (got 26 columns instead of 22)
    Line #32 (got 28 columns instead of 22)
    Line #33 (got 26 columns instead of 22)

【问题讨论】：

标签： python csv numpy genfromtxt

【解决方案1】：

您将 , 作为分隔符传递，而您的许多列值本身都包含元素。您需要指定一个明确的 quotechar 才能使其工作。

幸运的是，pandas 处理得非常好，没有太多的牵手。您可以尝试使用read_csv 加载数据，然后将加载的数据帧转换为数组。

import pandas as pd
array = pd.read_csv(name, index_col=[0]).values

加载的数据框（调用.values 之前得到的）如下所示：

df = pd.read_csv(name, index_col=[0])
print(df)

   mean_fit_time  mean_score_time  mean_test_score  mean_train_score  \
0       0.341662         0.001036         0.842927          0.846898   
1       0.554314         0.001825         0.846525          0.852755   
2       0.526688         0.001368         0.843761          0.847841   
3       0.494591         0.001116         0.840646          0.845428   
4       0.617542         0.002490         0.844902          0.850814   

   param_NN__alpha param_NN__hidden_layer_sizes  \
0              0.1                         (7,)   
1              0.1                      (25, 7)   
2              0.1                      (11, 7)   
3              0.1                       (7, 5)   
4              0.1                  (25, 11, 7)   

                                              params  rank_test_score  \
0  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               25   
1  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...                5   
2  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               17   
3  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               32   
4  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               11   

   split0_test_score  split0_train_score       ...         split2_test_score  \
0           0.842071            0.847529       ...                  0.845361   
1           0.846019            0.853014       ...                  0.847993   
2           0.842509            0.847968       ...                  0.845580   
3           0.838342            0.848462       ...                  0.846896   
4           0.841413            0.849394       ...                  0.850186   

   split2_train_score  split3_test_score  split3_train_score  \
0            0.846158           0.838526            0.848689   
1            0.849668           0.840061            0.851486   
2            0.852027           0.843352            0.851596   
3            0.851478           0.831286            0.838105   
4            0.851972           0.845985            0.856477   

   split4_test_score  split4_train_score  std_fit_time  std_score_time  \
0           0.848804            0.845736      0.050932        0.000182   
1           0.852535            0.850617      0.108354        0.000189   
2           0.851876            0.844420      0.104162        0.000323   
3           0.843757            0.838936      0.103976        0.000189   
4           0.844196            0.841568      0.194023        0.000476   

   std_test_score  std_train_score  
0        0.003738         0.001075  
1        0.004014         0.003307  
2        0.005278         0.003603  
3        0.005422         0.005727  
4        0.003050         0.005209  

[5 rows x 22 columns

是的，列会自动转换为适当的数据类型。

print(df.dtypes)

mean_fit_time                   float64
mean_score_time                 float64
mean_test_score                 float64
mean_train_score                float64
param_NN__alpha                 float64
param_NN__hidden_layer_sizes     object
params                           object
rank_test_score                   int64
split0_test_score               float64
split0_train_score              float64
split1_test_score               float64
split1_train_score              float64
split2_test_score               float64
split2_train_score              float64
split3_test_score               float64
split3_train_score              float64
split4_test_score               float64
split4_train_score              float64
std_fit_time                    float64
std_score_time                  float64
std_test_score                  float64
std_train_score                 float64
dtype: object

法定警告：由于其性质，此数据作为 python 列表可能比 numpy 数组（已针对标量进行了优化）对您更有用。

【讨论】：

我没有任何缺失值，因为数据框是网格搜索结果的输出..
@EB 请注意，您的分隔符是逗号，但您的许多单元格值也包含逗号 ;-) 顺便说一下，pandas 处理得很好。
@COLDSPEED，我明白这一点，但如果不可能的话，我可以不指定其他字段用引号括起来吗，.. 我打算阅读文件并在所有文件上添加双引号字段使其保持一致，然后将其传递给 genfromtxt 但我无法找到一种方法将其传递给 genfromtxt 我的列用双引号引起来..如果可能的话
@EB 不，您需要np.genfromtxt 来处理quotechar 功能，我不相信它能够做到这一点。导入pandas并使用有什么问题？
@COLDSPEED，是的，pandas 实现也能正常工作.. 只是想知道为什么 genfromtxt 的奇怪行为