使用 pandas.read_csv 导入有空格的文本数据答案

【问题标题】：import text data having spaces using pandas.read_csv使用 pandas.read_csv 导入有空格的文本数据
【发布时间】：2018-12-19 10:12:04
【问题描述】：

我想使用 pandas.read_csv 导入一个文本文件：

1541783101     8901951488  file.log             12345  123456
1541783401     21872967680  other file.log       23456     123
1541783701     3  third file.log 23456     123

这里的困难在于，列之间用一个或多个空格分隔，但有一列包含包含空格的文件名。所以我不能使用sep=r"\s+" 来识别列，因为这会在第一个有空格的文件名处失败。文件格式没有固定的列宽。

但是，每个文件名都以“.log”结尾。我可以编写单独的正则表达式来匹配每一列。是否可以使用这些来识别要导入的列？或者是否可以编写一个分隔正则表达式来选择所有不匹配任何列匹配正则表达式的字符？

【问题讨论】：

在read_csv中使用分隔符参数？
您是否仔细检查过原始数据中没有'\t' 用作分隔符？

标签： python pandas regex-negation

【解决方案1】：

更新问题的答案 -

这是无论数据宽度如何都不会失败的代码。您可以根据自己的需要进行修改。

df = pd.read_table('file.txt', header=None)

# Replacing uneven spaces with single space
df = df[0].apply(lambda x: ' '.join(x.split()))

# An empty dataframe to hold the output
out = pd.DataFrame(np.NaN, index=df.index, columns=['col1', 'col2', 'col3', 'col4', 'col5'])

n_cols = 5      # number of columns
for i in range(n_cols-2):
    # 0 1
    if i == 0 or i == 1:
        out.iloc[:, i] = df.str.partition(' ').iloc[:,0]
        df = df.str.partition(' ').iloc[:,2]
    else:
        out.iloc[:, 4] = df.str.rpartition(' ').iloc[:,2]
        df = df.str.rpartition(' ').iloc[:,0]
        out.iloc[:,3] = df.str.rpartition(' ').iloc[:,2]
        out.iloc[:,2] = df.str.rpartition(' ').iloc[:,0]

print(out)

+---+------------+-------------+----------------+-------+--------+
|   |    col1    |      col2   |       col3     |   col4 |   col5 |
+---+------------+-------------+----------------+-------+--------+
| 0 | 1541783101 |  8901951488 | file.log       | 12345 | 123456 |
| 1 | 1541783401 | 21872967680 | other file.log | 23456 |    123 |
| 2 | 1541783701 |           3 | third file.log | 23456 |    123 |
+---+------------+-------------+----------------+-------+--------+

注意 - 代码被硬编码为 5 列。也可以泛化。

上一个答案 -

使用pd.read_fwf() 读取固定宽度的文件。

在你的情况下：

pd.read_fwf('file.txt', header=None)

+---+----------+-----+-------------------+-------+--------+
|   |    0     |  1  |         2         |   3   |   4    |
+---+----------+-----+-------------------+-------+--------+
| 0 | 20181201 |   3 | file.log          | 12345 | 123456 |
| 1 | 20181201 |  12 | otherfile.log     | 23456 |    123 |
| 2 | 20181201 | 200 | odd file name.log | 23456 |    123 |
+---+----------+-----+-------------------+-------+--------+

【讨论】：

可以在这里工作 - 但是，示例数据集看起来不像固定...
解决方案是使用OP提供的数据集构建的。
我知道 - 正如我所说的可以工作，因为它适用于这三行。但它们并不是很好的固定宽度，所以不知道在我们在这里看不到的数据文件中会有什么结构。
同意。让OP检查，如果需要修改，我会做。
原来的文件格式其实是不固定的。每行都有文件大小、时间戳、校验和等。