【问题标题】:Ignore backslash when reading tsv file in python在 python 中读取 tsv 文件时忽略反斜杠
【发布时间】:2016-08-12 18:28:14
【问题描述】:

我有一个大的sep="|" tsv,其地址字段包含一堆值,其中包含以下内容

...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...

结果如下:

line1)  ...xxx|yyy|Level 1 2 xxx Street\
line2)  (MYCompany)|...

尝试运行 quote=2 以使用 Pandas 将非数字转换为 read_table 中的字符串,但它仍将反斜杠视为新行。什么是忽略包含反斜杠转义到新行的字段中的值的行的有效方法,有没有办法忽略 \ 的新行?

理想情况下,它将准备数据文件,以便可以将其读入 pandas 中的数据帧。

更新:在第 3 行显示 5 行破损。

1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie

【问题讨论】:

  • 您能否在您的 tsv 中提供 3-4 行示例以及您目前正在运行的代码?
  • 当然,我添加了 4 个示例行,显示 tsv 的外观和反斜杠分隔行并返回行其余部分的新行的行。
  • 你为什么要在那里添加那些换行符?对我来说没有任何意义,只需在每个实际的 tsv 行保留一个换行符,就可以避免这整个混乱。
  • 我没有在其中添加它,它在我正在阅读的文件中,我试图忽略它:)
  • 这是一个 DB 转储,我正在处理 tsv 的疯狂。

标签: python csv python-3.x pandas dataframe


【解决方案1】:

这是另一个使用正则表达式的解决方案:

import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()

#Replace '\\n' with '\' using regex

fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()

cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)

将产生以下结果:

【讨论】:

    【解决方案2】:

    我认为您可以先尝试 read_csv 使用 sep 值,它的值 NOT 并且看起来它读正确:

    import pandas as pd
    import io
    
    temp=u"""
    49  XXX  Ave|Australia
    u7  38-46 South Street|Australia
    XXX Margaret Street\
    New South Wales|Australia
    Po box ZZZ|Australia"""
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
    print df
                                                  0
    0                        49  XXX  Ave|Australia
    1              u7  38-46 South Street|Australia
    2  XXX Margaret StreetNew South Wales|Australia
    3                          Po box ZZZ|Australia
    

    然后您可以使用to_csvread_csv 使用sep="|" 创建新文件:

    df.to_csv('myfile.csv', header=False, index=False)
    
    
    
    print pd.read_csv('myfile.csv', sep="|", header=None)
                                        0          1
    0                        49  XXX  Ave  Australia
    1              u7  38-46 South Street  Australia
    2  XXX Margaret StreetNew South Wales  Australia
    3                          Po box ZZZ  Australia
    

    不创建新文件的下一个解决方案,但写入变量output,然后使用io.StringIO 写入read_csv

    import pandas as pd
    import io
    
    temp=u"""
    49  XXX  Ave|Australia
    u7  38-46 South Street|Australia
    XXX Margaret Street\
    New South Wales|Australia
    Po box ZZZ|Australia"""
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
    print df
                                                  0
    0                        49  XXX  Ave|Australia
    1              u7  38-46 South Street|Australia
    2  XXX Margaret StreetNew South Wales|Australia
    3                          Po box ZZZ|Australia
    
    output = df.to_csv(header=False, index=False)
    print output
    49  XXX  Ave|Australia
    u7  38-46 South Street|Australia
    XXX Margaret StreetNew South Wales|Australia
    Po box ZZZ|Australia
    
    print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
                                        0          1
    0                        49  XXX  Ave  Australia
    1              u7  38-46 South Street  Australia
    2  XXX Margaret StreetNew South Wales  Australia
    3                          Po box ZZZ  Australia
    

    如果我在您的数据中对其进行测试,似乎 1. 和 2.rows 有 14 字段,接下来是两个 15 字段。

    所以我从两行(3. 和 4.)中删除了最后一项,也许这只是错字(我希望如此):

    import pandas as pd
    
    import io
    
    temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
    1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
    1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
    (My Company)|Australia|New South Wales|2000|Sydney
    1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
    print df
                                                       0
    0  1788768|1831171|208434489|2014-08-14 13:40:02|...
    1  1788772|1831177|202234489|2014-08-14 13:41:37|...
    2  1788776|1831182|205234489|2014-08-14 13:42:41|...
    3  1788780|1831186|202634489|2014-08-14 13:43:46|...
    
    output = df.to_csv(header=False, index=False)
    
    print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
            0        1          2                    3    4  5   6        7   \
    0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
    1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
    2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
    3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   
    
           8                                      9          10               11  \
    0  coupon                           49  XXX  Ave  Australia         Victoria   
    1     NaN                 u7  38-46 South Street  Australia  New South Wales   
    2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
    3     NaN                             Po box ZZZ  Australia  New South Wales   
    
         12         13  
    0  3025  Melbourne  
    1  2116     Sydney  
    2  2000     Sydney  
    3  2444  NSW Other  
    

    但如果数据正确,将参数names=range(15)添加到read_csv

    print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
            0        1          2                    3    4  5   6        7   \
    0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
    1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
    2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
    3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   
    
           8                                      9          10               11  \
    0  coupon                           49  XXX  Ave  Australia         Victoria   
    1     NaN                 u7  38-46 South Street  Australia  New South Wales   
    2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
    3     NaN                             Po box ZZZ  Australia  New South Wales   
    
         12         13              14  
    0  3025  Melbourne             NaN  
    1  2116     Sydney             NaN  
    2  2000     Sydney          Sydney  
    3  2444  NSW Other  Port Macquarie  
    

    【讨论】:

      猜你喜欢
      • 2020-08-24
      • 1970-01-01
      • 2021-10-11
      • 1970-01-01
      • 1970-01-01
      • 2018-05-19
      • 2019-06-25
      • 2016-10-17
      • 1970-01-01
      相关资源
      最近更新 更多