【问题标题】:Pandas HDF5 store unicode error on select queryPandas HDF5 在选择查询时存储 unicode 错误
【发布时间】:2021-04-02 14:14:49
【问题描述】:

我有从这个文件中读取的 unicode 数据:

Mdt,Doccompra,OrgC,Cen,NumP,Criadopor,Dtcriacao,Fornecedor,P,Fun
400,8751215432,2581,,1,MIGRAÇÃO,01.10.2004,75852214,,TD
400,5464282154,9874,,1,MIGRAÇÃO,01.10.2004,78995411,,FO

我有两个问题:

  1. 当我尝试查询此 unicode 数据时,我得到一个 UnicodeDecodeError

    Traceback (most recent call last):
      File "<ipython-input-1-4423dceb2b1d>", line 1, in <module>
        runfile('C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py', wdir='C:/Users/u5en/Documents/SAP/Programação')
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
        execfile(filename, namespace)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 48, in execfile
        exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
    
      File "C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py", line 15, in <module>
        store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 665, in select
        return it.get_result()
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
        results = self.func(self.start, self.stop, where)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
        columns=columns, **kwargs)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3968, in read
        if not self.read_axes(where=where, **kwargs):
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3201, in read_axes
        a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2058, in convert
        self.data, nan_rep=nan_rep, encoding=encoding)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4359, in _unconvert_string_array
        data = f(data)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1700, in __call__
        return self._vectorize_call(func=func, args=vargs)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1769, in _vectorize_call
        outputs = ufunc(*inputs)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4358, in <lambda>
        f = np.vectorize(lambda x: x.decode(encoding), otypes=[np.object])
    
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 7: unexpected end of data
    

我如何在 hdf5 中存储查询我的 unicode 数据?

  1. 我有许多列名我事先不知道并且不正确的 pytable 名称 (NaturalNameWarning) 的表。我希望用户能够查询这些列,所以我想知道当他们的名字阻止我时如何查询这些?我看到这曾经有 no easy fix,所以如果仍然是这种情况,我将删除标题中的违规字符。

    import csv
    import pandas as pd
    dados = pd.read_csv("EKKA - Cópia.csv")
    print(dados)
    store= pd.HDFStore('teste.h5' , encoding="utf-8")
    store.append("EKKA", dados, format="table", data_columns=True)
    store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
    store.close()
    

sqlite 中这样做会更好吗?

环境:

  • Windows 7 64 位
  • 熊猫 15.2
  • NumPy 1.9.2

【问题讨论】:

    标签: unicode pandas hdf5


    【解决方案1】:

    因此,在 Windows 7 上的 Python 2.7、pandas 0.15.2 下,一切正常,无需编码。但是在 Python 3.4 上,以下内容对我有用。显然某些字符在 'utf-8' 中无法表示; 'latin1' 编码通常可以解决这些问题。请注意,我必须首先使用这种编码读取 csv。

    >>> df = pd.read_csv('../../test.csv',encoding='latin1')
    >>> df
       Mdt   Doccompra  OrgC  Cen  NumP Criadopor   Dtcriacao  Fornecedor   P Fun
    0  400  8751215432  2581  NaN     1  MIGRAÇ\xc3O  01.10.2004    75852214 NaN  TD
    1  400  5464282154  9874  NaN     1  MIGRAÇ\xc3O  01.10.2004    78995411 NaN  FO
    

    此外,编码必须不是在打开商店时指定,而是在append/put调用时指定

    >>> df.to_hdf('test.h5','df',format='table',mode='w',data_columns=True,encoding='latin1')
    
    >>> pd.read_hdf('test.h5','df')
       Mdt   Doccompra  OrgC  Cen  NumP Criadopor   Dtcriacao  Fornecedor   P Fun
    0  400  8751215432  2581  NaN     1  MIGRAÇ\xc3O  01.10.2004    75852214 NaN  TD
    1  400  5464282154  9874  NaN     1  MIGRAÇ\xc3O  01.10.2004    78995411 NaN  FO
    

    一旦写入编码,读取时无需指定编码。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-07-14
      • 2017-03-28
      • 1970-01-01
      • 1970-01-01
      • 2017-02-23
      • 2013-05-14
      • 1970-01-01
      • 2016-01-14
      相关资源
      最近更新 更多