在 python 中读取 tsv 文件时忽略反斜杠答案

【问题标题】：Ignore backslash when reading tsv file in python在 python 中读取 tsv 文件时忽略反斜杠
【发布时间】：2016-08-12 18:28:14
【问题描述】：

我有一个大的sep="|" tsv，其地址字段包含一堆值，其中包含以下内容

...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...

结果如下：

line1)  ...xxx|yyy|Level 1 2 xxx Street\
line2)  (MYCompany)|...

尝试运行 quote=2 以使用 Pandas 将非数字转换为 read_table 中的字符串，但它仍将反斜杠视为新行。什么是忽略包含反斜杠转义到新行的字段中的值的行的有效方法，有没有办法忽略 \ 的新行？

理想情况下，它将准备数据文件，以便可以将其读入 pandas 中的数据帧。

更新：在第 3 行显示 5 行破损。

1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie

【问题讨论】：

您能否在您的 tsv 中提供 3-4 行示例以及您目前正在运行的代码？
当然，我添加了 4 个示例行，显示 tsv 的外观和反斜杠分隔行并返回行其余部分的新行的行。
你为什么要在那里添加那些换行符？对我来说没有任何意义，只需在每个实际的 tsv 行保留一个换行符，就可以避免这整个混乱。
我没有在其中添加它，它在我正在阅读的文件中，我试图忽略它:)
这是一个 DB 转储，我正在处理 tsv 的疯狂。

标签： python csv python-3.x pandas dataframe

【解决方案1】：

这是另一个使用正则表达式的解决方案：

import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()

#Replace '\\n' with '\' using regex

fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()

cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)

将产生以下结果：

【讨论】：

【解决方案2】：

我认为您可以先尝试 read_csv 使用 sep 值，它的值 NOT 并且看起来它读正确：

import pandas as pd
import io

temp=u"""
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
                                              0
0                        49  XXX  Ave|Australia
1              u7  38-46 South Street|Australia
2  XXX Margaret StreetNew South Wales|Australia
3                          Po box ZZZ|Australia

然后您可以使用to_csv 和read_csv 使用sep="|" 创建新文件：

df.to_csv('myfile.csv', header=False, index=False)



print pd.read_csv('myfile.csv', sep="|", header=None)
                                    0          1
0                        49  XXX  Ave  Australia
1              u7  38-46 South Street  Australia
2  XXX Margaret StreetNew South Wales  Australia
3                          Po box ZZZ  Australia

不创建新文件的下一个解决方案，但写入变量output，然后使用io.StringIO 写入read_csv：

import pandas as pd
import io

temp=u"""
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
                                              0
0                        49  XXX  Ave|Australia
1              u7  38-46 South Street|Australia
2  XXX Margaret StreetNew South Wales|Australia
3                          Po box ZZZ|Australia

output = df.to_csv(header=False, index=False)
print output
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia

print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
                                    0          1
0                        49  XXX  Ave  Australia
1              u7  38-46 South Street  Australia
2  XXX Margaret StreetNew South Wales  Australia
3                          Po box ZZZ  Australia

如果我在您的数据中对其进行测试，似乎 1. 和 2.rows 有 14 字段，接下来是两个 15 字段。

所以我从两行（3. 和 4.）中删除了最后一项，也许这只是错字（我希望如此）：

import pandas as pd

import io

temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
                                                   0
0  1788768|1831171|208434489|2014-08-14 13:40:02|...
1  1788772|1831177|202234489|2014-08-14 13:41:37|...
2  1788776|1831182|205234489|2014-08-14 13:42:41|...
3  1788780|1831186|202634489|2014-08-14 13:43:46|...

output = df.to_csv(header=False, index=False)

print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
        0        1          2                    3    4  5   6        7   \
0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   

       8                                      9          10               11  \
0  coupon                           49  XXX  Ave  Australia         Victoria   
1     NaN                 u7  38-46 South Street  Australia  New South Wales   
2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
3     NaN                             Po box ZZZ  Australia  New South Wales   

     12         13  
0  3025  Melbourne  
1  2116     Sydney  
2  2000     Sydney  
3  2444  NSW Other

但如果数据正确，将参数names=range(15)添加到read_csv：

print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
        0        1          2                    3    4  5   6        7   \
0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   

       8                                      9          10               11  \
0  coupon                           49  XXX  Ave  Australia         Victoria   
1     NaN                 u7  38-46 South Street  Australia  New South Wales   
2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
3     NaN                             Po box ZZZ  Australia  New South Wales   

     12         13              14  
0  3025  Melbourne             NaN  
1  2116     Sydney             NaN  
2  2000     Sydney          Sydney  
3  2444  NSW Other  Port Macquarie

【讨论】：