【问题标题】:Python : Reading CSV using np.genfromtxt resulting in different number of columnsPython:使用 np.genfromtxt 读取 CSV 导致列数不同
【发布时间】:2018-04-03 03:58:17
【问题描述】:

我正在使用 np.genfromtxt 读取 csv。我不确定它为什么会在数据上引发 ValueError(errmsg) 。当我在 excel 中读取文件时,它显示文件中所有 33 行的总共 23 列

这里是代码和错误:

csv = np.genfromtxt (fname, delimiter=",",names=True)

这是 csv 记录的 sn-p:

,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_NN__alpha,param_NN__hidden_layer_sizes,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.34166226387023924,0.0010362625122070312,0.842927342927343,0.8468980402379758,0.1,"(7,)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7,)}",25,0.8420706295240185,0.8475292052871167,0.8398771660451854,0.8463774474853288,0.845360824742268,0.846158065046893,0.8385256691531373,0.8486892618185806,0.8488040377441299,0.8457362215519605,0.05093153997183547,0.00018195987247183776,0.0037378988316037944,0.0010747322296072162
1,0.5543142318725586,0.0018250465393066407,0.8465250965250966,0.8527554135893668,0.1,"(25, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 7)}",5,0.846018863785918,0.8530137662480118,0.846018863785918,0.8589919376953875,0.8479929809168677,0.8496681840618658,0.8400614304519526,0.851486234506965,0.8525345622119815,0.8506169454346038,0.10835399357094619,0.00018853748087819175,0.004013613789285713,0.003306836154659678
2,0.5266880512237548,0.0013680458068847656,0.8437609687609687,0.8478413817137904,0.1,"(11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (11, 7)}",17,0.842509322219785,0.8479679701639884,0.8354902390875192,0.8431964021280096,0.8455801710901514,0.8520265452750507,0.8433523475208424,0.851595919710431,0.8518762343647136,0.8444200712914725,0.1041624682160838,0.0003233587082439388,0.005278162504355272,0.0036030369022985215
3,0.49459095001220704,0.0011162281036376954,0.8406458406458407,0.845428443186931,0.1,"(7, 5)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7, 5)}",32,0.8383417416100022,0.848461580650469,0.8429480149155516,0.8501617945483464,0.8468962491774512,0.8514780891789612,0.8312856516015796,0.8381046396841066,0.8437568575817423,0.8389361118727722,0.10397613499936685,0.00018889068500539376,0.005421511394261151,0.005726975087304059
4,0.6175418376922608,0.0024899959564208983,0.8449017199017199,0.8508140227747922,0.1,"(25, 11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 11, 7)}",11,0.8414125904803685,0.8493939560138211,0.8427286685676684,0.8546591345362804,0.8501864443957008,0.8519716996654417,0.8459850811759544,0.8564769112646704,0.8441957428132544,0.8415684123937482,0.1940231074769015,0.00047604030307216253,0.003049662553913791,0.005209439647677219

收到错误:

ValueError: Some errors were detected !
    Line #2 (got 26 columns instead of 22)
    Line #3 (got 26 columns instead of 22)
    Line #4 (got 26 columns instead of 22)
    Line #5 (got 26 columns instead of 22)
    Line #6 (got 28 columns instead of 22)
    Line #7 (got 26 columns instead of 22)
    Line #8 (got 28 columns instead of 22)
    Line #9 (got 26 columns instead of 22)
    Line #10 (got 26 columns instead of 22)
    Line #11 (got 26 columns instead of 22)
    Line #12 (got 26 columns instead of 22)
    Line #13 (got 26 columns instead of 22)
    Line #14 (got 28 columns instead of 22)
    Line #15 (got 26 columns instead of 22)
    Line #16 (got 28 columns instead of 22)
    Line #17 (got 26 columns instead of 22)
    Line #18 (got 26 columns instead of 22)
    Line #19 (got 26 columns instead of 22)
    Line #20 (got 26 columns instead of 22)
    Line #21 (got 26 columns instead of 22)
    Line #22 (got 28 columns instead of 22)
    Line #23 (got 26 columns instead of 22)
    Line #24 (got 28 columns instead of 22)
    Line #25 (got 26 columns instead of 22)
    Line #26 (got 26 columns instead of 22)
    Line #27 (got 26 columns instead of 22)
    Line #28 (got 26 columns instead of 22)
    Line #29 (got 26 columns instead of 22)
    Line #30 (got 28 columns instead of 22)
    Line #31 (got 26 columns instead of 22)
    Line #32 (got 28 columns instead of 22)
    Line #33 (got 26 columns instead of 22)

【问题讨论】:

    标签: python csv numpy genfromtxt


    【解决方案1】:

    您将 , 作为分隔符传递,而您的许多列值本身都包含元素。您需要指定一个明确的 quotechar 才能使其工作。

    幸运的是,pandas 处理得非常好,没有太多的牵手。您可以尝试使用read_csv 加载数据,然后将加载的数据帧转换为数组。

    import pandas as pd
    array = pd.read_csv(name, index_col=[0]).values
    

    加载的数据框(调用.values 之前得到的)如下所示:

    df = pd.read_csv(name, index_col=[0])
    print(df)
    
       mean_fit_time  mean_score_time  mean_test_score  mean_train_score  \
    0       0.341662         0.001036         0.842927          0.846898   
    1       0.554314         0.001825         0.846525          0.852755   
    2       0.526688         0.001368         0.843761          0.847841   
    3       0.494591         0.001116         0.840646          0.845428   
    4       0.617542         0.002490         0.844902          0.850814   
    
       param_NN__alpha param_NN__hidden_layer_sizes  \
    0              0.1                         (7,)   
    1              0.1                      (25, 7)   
    2              0.1                      (11, 7)   
    3              0.1                       (7, 5)   
    4              0.1                  (25, 11, 7)   
    
                                                  params  rank_test_score  \
    0  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               25   
    1  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...                5   
    2  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               17   
    3  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               32   
    4  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               11   
    
       split0_test_score  split0_train_score       ...         split2_test_score  \
    0           0.842071            0.847529       ...                  0.845361   
    1           0.846019            0.853014       ...                  0.847993   
    2           0.842509            0.847968       ...                  0.845580   
    3           0.838342            0.848462       ...                  0.846896   
    4           0.841413            0.849394       ...                  0.850186   
    
       split2_train_score  split3_test_score  split3_train_score  \
    0            0.846158           0.838526            0.848689   
    1            0.849668           0.840061            0.851486   
    2            0.852027           0.843352            0.851596   
    3            0.851478           0.831286            0.838105   
    4            0.851972           0.845985            0.856477   
    
       split4_test_score  split4_train_score  std_fit_time  std_score_time  \
    0           0.848804            0.845736      0.050932        0.000182   
    1           0.852535            0.850617      0.108354        0.000189   
    2           0.851876            0.844420      0.104162        0.000323   
    3           0.843757            0.838936      0.103976        0.000189   
    4           0.844196            0.841568      0.194023        0.000476   
    
       std_test_score  std_train_score  
    0        0.003738         0.001075  
    1        0.004014         0.003307  
    2        0.005278         0.003603  
    3        0.005422         0.005727  
    4        0.003050         0.005209  
    
    [5 rows x 22 columns
    

    是的,列会自动转换为适当的数据类型。

    print(df.dtypes)
    
    mean_fit_time                   float64
    mean_score_time                 float64
    mean_test_score                 float64
    mean_train_score                float64
    param_NN__alpha                 float64
    param_NN__hidden_layer_sizes     object
    params                           object
    rank_test_score                   int64
    split0_test_score               float64
    split0_train_score              float64
    split1_test_score               float64
    split1_train_score              float64
    split2_test_score               float64
    split2_train_score              float64
    split3_test_score               float64
    split3_train_score              float64
    split4_test_score               float64
    split4_train_score              float64
    std_fit_time                    float64
    std_score_time                  float64
    std_test_score                  float64
    std_train_score                 float64
    dtype: object
    

    法定警告:由于其性质,此数据作为 python 列表可能比 numpy 数组(已针对标量进行了优化)对您更有用。

    【讨论】:

    • 我没有任何缺失值,因为数据框是网格搜索结果的输出..
    • @EB 请注意,您的分隔符是逗号,但您的许多单元格值也包含逗号 ;-) 顺便说一下,pandas 处理得很好。
    • @COLDSPEED,我明白这一点,但如果不可能的话,我可以不指定其他字段用引号括起来吗,.. 我打算阅读文件并在所有文件上添加双引号字段使其保持一致,然后将其传递给 genfromtxt 但我无法找到一种方法将其传递给 genfromtxt 我的列用双引号引起来..如果可能的话
    • @EB 不,您需要np.genfromtxt 来处理quotechar 功能,我不相信它能够做到这一点。导入pandas并使用有什么问题?
    • @COLDSPEED,是的,pandas 实现也能正常工作.. 只是想知道为什么 genfromtxt 的奇怪行为
    猜你喜欢
    • 2023-02-21
    • 2015-08-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-24
    • 2022-12-18
    • 2013-12-02
    • 1970-01-01
    相关资源
    最近更新 更多