制表符分隔的python 3 .text文件读取答案

【问题标题】：Tab delineated python 3 .txt file reading制表符分隔的python 3 .text文件读取
【发布时间】：2012-03-29 03:35:41
【问题描述】：

我在开始作业时遇到了一些麻烦。我们收到了一个制表符描述的 .txt 文件，其中包含 6 列数据和大约 50 行数据。我需要帮助创建一个列表来存储这些数据以供以后调用。最终，我需要能够列出任何特定列的所有内容并对其进行排序、计数等。任何帮助将不胜感激。

编辑；除了研究这类东西之外，我真的没有做太多事情，我知道我正在研究 csv，而且我之前做过单列 .txt 文件，但我不知道如何解决这种情况。我将如何为单独的列命名？当一行结束，下一行开始时，我将如何告诉程序？

【问题讨论】：

看看csv 模块

标签： python csv tabs tuples

【解决方案1】：

Pandas 中的数据帧结构基本上完全符合您的要求。如果您熟悉的话，它与 R 中的数据框非常相似。它内置了用于子集、排序和以其他方式处理表格数据的选项。

它读取directly from csv，甚至自动读取列名。你会打电话给：

read_csv(yourfilename, 
         sep='\t',     # makes it tab delimited
         header=1)     # makes the first row the header row.

在 Python 3 中工作。

【讨论】：

我的数据在 .txt 文件中没有标题，我可以为 6 列数据创建它们吗？
是的，在这种情况下，您只需致电read_csv(yourfilename, sep='\t', names=['header1', 'header2', ...])。文档在这里：pandas.sourceforge.net/generated/…

【解决方案2】：

假设您有一个如下所示的 csv。

 1       2       3       4       5       6
 1       2       3       4       5       6
 1       2       3       4       5       6
 1       2       3       4       5       6
 1       2       3       4       5       6

您可以像这样将它们读入字典：

>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],  dialect='excel-tab')
>>> for row in reader:
...     print row    
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}   
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}   
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}   
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}   
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}

但 Pandas 库可能更适合于此。 http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files

【讨论】：

谢谢你，这真的很有帮助！但是在我的数据中没有标题，有没有办法在不改变原始 .txt 文件的情况下将标题分配给列？
是的，DictReader 有一个可选参数，可用于传入字段名。我已经编辑了答案以反映这一点。

【解决方案3】：

听起来像是更适合数据库的工作。您应该只使用 PostgreSQLs COPY FROM 之类的操作将 CSV 数据导入表中，然后使用 python + SQL 来满足您所有的排序、搜索和匹配需求。

如果您觉得真正的数据库太过分了，还有 SQLlite 和 BerkleyDB 之类的选项，它们都有 python 模块。

编辑：BerkelyDB 已弃用，但 anydbm 在概念上相似。

【讨论】：

是的，我可以很容易地在 ArcGIS 中做到这一点，但我需要为此使用 Python 3，有什么想法吗？
您仍然可以使用 python 来控制查询。如果您只想使用 python，那么您只是在 python 中实现一个数据库，这很难有效地利用时间或资源。您如何看待 berkleydb/sqlite 选项？
啊，它必须在 Python 3 中。我知道有很多更有效的方法来做这种事情哈哈。但遗憾的是它必须与 python3... :/
那么我不得不说您的问题过于宽泛，您的要求令人困惑。 anydbm 模块，特别是 dbm.dumb 模块是 python 模块（dumbdbm 甚至是用 python 代码编写的）。它们在任何意义上都是“在 Python 3 中”。他们使用的“数据库”只是您磁盘上的文件。

【解决方案4】：

我认为对 50 行和 6 列使用 db 是多余的，所以这是我的想法：

from __future__ import print_function
import os
from operator import itemgetter


def get_records_from_file(path_to_file):
    """
    Read a tab-deliminated file and return a
    list of dictionaries representing the data.
    """
    records = []
    with open(path_to_file, 'r') as f:
        # Use the first line to get names for columns
        fields = [e.lower() for e in f.readline().split('\t')]

        # Iterate over the rest of the lines and store records
        for line in f:
            record = {}
            for i, field in enumerate(line.split('\t')):
                record[fields[i]] = field
            records.append(record)

    return records


if __name__ == '__main__':
    path = os.path.join(os.getcwd(), 'so.txt')
    records = get_records_from_file(path)

    print('Number of records: {0}'.format(len(records)))

    s = sorted(records, key=itemgetter('id'))
    print('Sorted: {0}'.format(s))

要存储记录以供以后使用，请查看 Python 的 pickle library——这将允许您将它们保存为 Python 对象。

另外，请注意，我现在使用的计算机上没有安装 Python 3，但我很确定这适用于 Python 2 或 3。

【讨论】：