从 CSV 文件中仅读取一个随机行并移动到另一个 CSV答案

【问题标题】：Read only one random row from CSV file and move to another CSV从 CSV 文件中仅读取一个随机行并移动到另一个 CSV
【发布时间】：2016-10-21 08:00:36
【问题描述】：

我在从大型 csv 文件读取随机行并将其移动到另一个 CSV 文件时遇到问题，在 Windows 上使用 0.18.1 pandas 和 2.7.10 Python。

我只想将随机选择的行加载到内存中并将它们移动到另一个 CSV。我不想将第一个 CSV 的全部内容加载到内存中。

这是我使用的代码：

import random

file_size = 100
f = open("customers.csv",'r')
o = open("train_select.csv", 'w')
for i in range(0, 50):
    offset = random.randrange(file_size)
    f.seek(offset)
    f.readline()
    random_line = f.readline()
    o.write(random_line)

当前的输出看起来像这样：

2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street;   96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA

我的问题有两个：

我还想在第二个 csv 中查看标题，而不仅仅是行。
随机函数只能选择一行。

输出应该是这样的：

id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street;   96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA

【问题讨论】：

您没有选择随机行。当您在文件中寻找一个随机数时，您很可能位于该行中间的某个位置。
@OskarSkog：否：前半行已删除。但这仍然很笨拙。看我的回答。
你说你使用熊猫，但我在这里看不到。看来您真正想做的是拆分数据集（我猜是出于 ml 目的），pandas 有 df.sample 。这将解决您的 2 个问题。
我认为数据集真的很大。熊猫可能不会有帮助
@Jean-FrançoisFabre：是的，你是对的，我没有注意到丢弃 f.readline() 行。

标签： python csv random

【解决方案1】：

你做的比这更简单：

首先，请完整阅读客户档案，标题是特殊情况，请勿使用。
随机排列行列表（这就是您要查找的内容）
写回标题 + 改组的行

代码：

import random

with open("customers.csv",'r') as f:
    title = f.readline()
    lines = f.readlines()

random.shuffle(lines)

with open("train_select.csv", 'w') as f:
    f.write(title)
    f.writelines(lines)

编辑：如果您不想将整个文件保存在内存中，这里有一个替代方案。唯一的缺点是您必须读取一次文件（但不存储在内存中）才能计算行偏移：

import random

input_file = "customers.csv"

line_offsets = list()

# just read the title
with open(input_file,'r') as f:
    title = f.readline()
    # store offset of the first
    while True:
        # store offset of the next line start
        line_offsets.append(f.tell())
        line = f.readline()
        if line=="":
            break

    # now shuffle the offsets
    random.shuffle(line_offsets)

    # and write the output file
    with open("train_select.csv", 'w') as fw:
        fw.write(title)
        for offset in line_offsets:
            # seek to a line start
            f.seek(offset)
            fw.write(f.readline())

【讨论】：

对不起，我忘了说它是一个大的 csv 文件，所以我无法将 csv 的全部内容加载到内存中。我只想将随机选择的输入加载到内存中
我的替代解决方案仍然需要读取一次文件（但不存储）。告诉我它是否适合你。
脚本没有运行芽。当我执行它时，它会无限运行而不显示任何结果。我正在使用 pycharm ide。
这意味着文件真的很大：它需要一段时间但会结束。您可以尝试使用较小的文件吗？或者打印进度（行数？）
我尝试使用 50 行的 csv。脚本仍然无限期地执行。

【解决方案2】：

应 OP 要求，由于我之前的 2 个实现必须读取输入文件，因此这里有一个更复杂的实现，其中不预先读取文件。

它使用bisect来存储行的偏移量，以及最小行len（待配置）以避免随机列表太长而无用。

基本上，程序生成随机排序的偏移量，范围从第二行的偏移量（跳过标题行）到文件末尾，以 minimum_line_len 为步长。

对于每个偏移量，它会检查是否尚未读取行（使用bisect，这很快，但由于极端情况，进一步测试很复杂）。 - 如果未读取，则跳回查找上一个换行符（即正在读取文件，否则无法执行）将其写入输出文件，将开始/结束偏移量存储在对列表中 - 如果已经阅读，请跳过

代码：

import random,os,bisect

input_file = "csv2.csv"

input_size = os.path.getsize(input_file)

smallest_line_len = 4

line_offsets = []

# just read the title
with open(input_file,'r') as f, open("train_select.csv", 'w') as fw:
    # read title and write it back
    title = f.readline()
    fw.write(title)

    # generate offset list, starting from current pos to the end of file
    # with a step of min line len to avoid generating too many numbers
    # (this can be 1 but that will take a while)
    offset_list = list(range(f.tell(),input_size,smallest_line_len))
    # shuffle the list at random
    random.shuffle(offset_list)

    # now loop through the offsets
    for offset in offset_list:
        # look if the offset is already contained in the list of sorted tuples
        insertion_point = bisect.bisect(line_offsets,(offset,0))

        if len(line_offsets)>0 and insertion_point == len(line_offsets) and line_offsets[-1][1]>offset:
            # bisect tells to insert at the end: check if within last couple boundary: if so, already processed
            continue
        elif insertion_point < len(line_offsets) and (offset==line_offsets[insertion_point][0] or
               (0 < insertion_point and line_offsets[insertion_point-1][0]<=offset<=line_offsets[insertion_point-1][1])):
            # offset is already known, line has already been processed: skip
            continue
        else:
            # offset is not known: rewind until we meet an end of line
            f.seek(offset)

            while True:
                c=f.read(1)
                if c=="\n":
                    # we found the line terminator of the previous line: OK
                    break
                offset -= 1
                f.seek(offset)
            # now store the current position: start of the current line
            line_start = offset+1
            # now read the line fully
            line = f.readline()
            # now compute line end (approx..)
            line_end = f.tell() - 1
            # and insert the "line" in the sorted list
            line_offsets.insert(insertion_point,(line_start,line_end))
            fw.write(line)

如果

【讨论】：

重复行的问题仍然存在，即同一行被多次移动到第二个csv文件。
id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state 1;jwcdf-name;fsj-firstname;13520;oem-city;1954/02/07;amrb -street;145;AK;ALASKA 2;flhxu-name;tum-firstname;17520;buo-city;1966/04/24;wfyz-street;96;GA;GEORGIA 3;xthfg-name;gfe-firstname;12560 ;vtz-city;1990/01/11;doxx-street;46;NJ;NEW JERSEY 4;ulzrz-name;bnl-firstname;11620;prz-city;1966/08/02;bxqn-street;104;NY ;纽约
这是输入 csv 的一部分（在你的例子中是 csv2.csv）。当我把它输入你的脚本时，这是我得到的输出：
id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state 4;ulzrz-name;bnl-firstname;11620;prz-city;1966/08/02;bxqn -street;104;NY;NEW YORK 4;ulzrz-name;bnl-firstname;11620;prz-city;1966/08/02;bxqn-street;104;NY;NEW YORK 3;xthfg-name;gfe-firstname ;12560;vtz-city;1990/01/11;doxx-street;46;NJ;NEW JERSEY 2;flhxu-name;tum-firstname;17520;buo-city;1966/04/24;wfyz-street;96 ;GA;GEORGIA 1;jwcdf-name;fsj-firstname;13520;oem-city;1954/02/07;amrb-street;145;AK;ALASKA
如您所见，正在将同一行读入输出 csv 文件，即 train_select.csv