在 Python 3 的 csv.DictReader 中合并两个几乎相同的行答案

【问题标题】：Merge two almost identical rows in a csv.DictReader in Python 3在 Python 3 的 csv.DictReader 中合并两个几乎相同的行
【发布时间】：2017-11-30 16:03:29
【问题描述】：

我有以下数据，只是想不出在 Python 中合并它的解决方案：

数据如下所示：

ID    OFFSET    TEXT
1     1         This text is short
2     1         This text is super long and got cut by the database s
2     2000      o it will come out like this
3     1         I'm short too

我一直在尝试使用 csv.DictReader 和 csv.DictWriter。

【问题讨论】：

你试过什么？只有向我们展示代码，我们才能提供帮助。

标签： python python-3.x export-to-csv

【解决方案1】：

使用itertools.groupby按id分组，然后加入文本：

import itertools
import operator

#dr is the DictRreader
for dbid, rows in itertools.groupby(dr, key=operator.itemgetter('ID')):
    print(dbid, ''.join(row['TEXT'] for row in rows))

【讨论】：

【解决方案2】：

groupby 将创建元组，其中元组值是按 ID 列出的 TEXT 项列表。

txt="""ID,OFFSET,TEXT
1,     1,         This text is short
2,     1,         This text is super long and got cut by the database s
2,     2000,      o it will come out like this
3,     1,         I'm short too
"""

from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =',')

df.set_index('ID',inplace=True)


for my_tuple in df.groupby(df.index)['TEXT']:
    lst=[item.strip() for item in my_tuple[1]]
    print(". ".join(lst))
    print("\n")

输出：

This text is short

This text is super long and got cut by the database s. o it will come out like this

 I'm short too

【讨论】：

【解决方案3】：

csv.DictReader 和 csv.DictWriter 类用于 CSV 文件，尽管您可以让它们读取固定列描述的文件，就像您展示的那样，但这并不是必需的，而且会使事情复杂化。

假设记录是有序的，你需要做的就是：

读取每一行（丢弃第一行）
读取ID、偏移量和文本（丢弃偏移量）
如果 ID 是新的，则存储从 ID 到文本的映射
如果 ID 不是新的，请附加文本。

Python 可以在没有模块的情况下完成所有这些工作。

这是一个初步的方法：

text="""
ID    OFFSET    TEXT
1     1         This text is short
2     1         This text is super long and got cut by the database s
2     2000      o it will come out like this
3     1         I'm short too
""".strip()

lines = text.splitlines()
columns = lines.pop(0)  # don't need the columns
result = dict()

for line in lines:
    # the maxsplit arg is important to keep all the text
    id, offset, text = line.split(maxsplit=2)
    if id in result:
        result[id] += text
    else:
        result[id] = text

print("Result:")
for id, text in result.items():
    print(f"ID {id} -> '{text}'")

这使用 Python 3.6 f-strings，但如果您愿意，您也可以获得相同的结果，例如：

...
    print("ID %s -> '%s'" % (id, text)

不管怎样，结果是：

Result:
ID 1 -> 'This text is short'
ID 2 -> 'This text is super long and got cut by the database so it will come out like this'
ID 3 -> 'I'm short too'

条件检查 if id in result 是“ok”，但您可以使用 defaultdict 避免它：

from collections import defaultdict

result = defaultdict(str)
for line in lines:
    id, offset, text = line.split(maxsplit=2)
    result[id] += text  # <-- much better

print("Result:")
for id, text in result.items():
    print(f"ID {id} -> '{text}'")

collections 包有许多方便的实用程序。

【讨论】：