使用基本库优化 python 代码答案

【问题标题】：optimize python code with basic libraries使用基本库优化 python 代码
【发布时间】：2021-09-20 21:31:39
【问题描述】：

我正在尝试在具有 170 万行和 4 个变量的表上使用基本 python 进行非 equi 自连接。数据如下所示：

product     position_min     position_max      count_pos
A.16        167804              167870              20
A.18        167804              167838              15
A.15        167896              167768              18
A.20        238359              238361              33
A.35        167835              167837              8

这里是我使用的代码：

import csv
from collections import defaultdict
import sys
import os

list_csv=[]
l=[]
with open(r'product.csv', 'r') as file1:
    my_reader1 = csv.reader(file1, delimiter=';')
    for row in my_reader1:
        list_csv.append(row)
with open(r'product.csv', 'r') as file2:
    my_reader2 = csv.reader(file2, delimiter=';') 
    with open('product_p.csv', "w") as csvfile_write:
        ecriture = csv.writer(csvfile_write, delimiter=';',
                                quotechar='"', quoting=csv.QUOTE_ALL)
        for row in my_reader2:
            res = defaultdict(list)
            for k in range(len(list_csv)):
                comp= list_csv[k]
                try:
                    if int(row[1]) >= int(comp[1]) and int(row[2]) <= int(comp[2]) and row[0] != comp[0]:
                        res[row[0]].append([comp[0],comp[3]]) 
                except:
                    pass
            


            if bool(res):    
                for key, value in res.items():
                    sublists = defaultdict(list)
                    for sublist in value:
                        l=[]
                        sublists[sublist[0]].append(int(sublist[1]))
                    l.append(str(key) + ";"+ str(min(sublists.keys(), key=(lambda k: sublists[k]))))
                        ecriture.writerow(l)

我应该在“product_p.csv”文件中得到这个：

'A.18'; 'A.16'
'A.15'; 'A.18'
'A.35'; 'A.18'

代码所做的是两次读取同一个文件，第一次完全读取，并将其转换为列表，第二次逐行查找，即为每个产品（第一个变量）查找所有产品根据 position_min 和 position_max 的条件，它属于哪个，然后通过保留 count_pos 最小值的产品只选择一个。

我在原始数据样本上进行了尝试，它可以工作，但是有 170 万行，它运行了几个小时而没有给出任何结果。有没有办法用我们的或更少的循环来做到这一点？任何人都可以帮助使用基本的 python 库来优化它吗？

提前谢谢你

【问题讨论】：

你能大致解释一下，你的代码在做什么吗？我不明白你的代码到底在做什么！
@Kshitiz 感谢您的回答。代码所做的是两次读取同一个文件，第一次完全读取，并将其转换为列表，第二次逐行读取，即为每个产品（第一个变量）查找它所属的所有产品根据 position_min 和 position_max 的条件，然后通过保留 count_pos 最小值的产品只选择一个。
你不是说你想得到这个A16 A35，但实际上你得到的是A35 A16，这是否可以？
@Kshitiz，我更正了编译代码时应该得到的内容。
现在我也不完全明白你想要做什么，但我曾尝试在pandas 中做，但熊猫的速度也比你的代码更差。我已经用 2000 个数据集进行了测试，但是你的代码比我的快。如果我完全了解您在做什么，那么我也可以尝试其他方法！而且我在您的代码中注意到，您不必为完全相同的数据读取该文件 2 次，您可以在第二次使用以前的数据。我没有检查这是否会使您的代码更快，但我注意到了

标签： python loops optimization

【解决方案1】：

我删除了一些未使用的库，并尝试尽可能简化代码的行为。

代码中最重要的对象是列表input_data，它存储来自输入csv 文件的数据和字典out_dict，它存储比较的输出。

简单来说，代码的作用是：

将product.csv（不带标题）读入列表input_data
遍历input_data，将每一行与每一其他行进行比较
- 如果参考产品范围在比较产品范围内，我们检查一个新条件：out_dict 中是否有参考产品的内容？
  - 如果是，我们将其替换为新的比较产品如果它具有较低的count_pos
  - 如果没有，我们无论如何都会添加比较产品
将out_dict 中的信息写入输出文件product_p.csv，但仅适用于具有有效比较产品的产品

这里是：

import csv

input_data = []
with open('product.csv', 'r') as csv_in:
    reader = csv.reader(csv_in, delimiter=';')
    next(reader)
    for row in reader:
        input_data.append(row)


out_dict = {}
for ref in input_data:
    for comp in input_data:
        if ref == comp:
            continue
        elif int(ref[1]) >= int(comp[1]) and int(ref[2]) <= int(comp[2]):
            if not out_dict.get(ref[0], False) or int(comp[3]) < out_dict[ref[0]][1]:
                out_dict[ref[0]] = (comp[0], int(comp[3]))
                # print(f"In '{ref[0]}': placed '{comp[0]}'")


with open('product_p.csv', "w") as csv_out:
    ecriture = csv.writer(csv_out, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL)
    for key, value in out_dict.items():
        if value[0]:
            ecriture.writerow([key, value[0]])

另外，我注释掉了一个 print 行，它可以向您展示 - 使用只有几行的示例文件 - 脚本正在做什么。

注意：我相信您的预期输出是错误的。要么，要么我在解释中遗漏了一些东西。如果是这种情况，请告诉我。提供的代码考虑了这一点。

来自样本输入：

product;position_min;position_max;count_pos
A.16;167804;167870;20
A.18;167804;167838;15
A.15;167896;167768;18
A.20;238359;238361;33
A.35;167835;167837;8

预期的输出是：

"A.18";"A.16"
"A.15";"A.35"
"A.35";"A.18"

因为对于“A.15”，“A.35”满足与“A.16”和“A.18”相同的条件并且具有较小的count_pos。

【讨论】：

刚刚更新了答案，因为正如我所见，没有真正需要使用defaultdict，一个简单的get 就足够了。要使用 [] 表示法创建密钥，不需要已经有默认值。此外，只需 out_dict.get(ref[0]) 就足够了 - 因为它会返回 None，这是布尔值 False - 虽然不是最佳实践。

【解决方案2】：

我认为这里需要一种不同的方法，因为将每个产品相互比较总是会得到 O(n^2) 的时间复杂度。

我通过升序position_min（和降序position_max，以防万一）对产品列表进行排序，并从上面的答案中反转检查：而不是查看comp“包含”ref我做了相反的事情.通过这种方式，可以仅针对具有更高position_min 的产品检查每个产品，并且一旦发现comp 的position_min 高于position_max 的ref 的position_max，就可以停止搜索。

为了测试这个解决方案，我生成了一个包含 100 种产品的随机列表，并运行从上述答案复制的一个函数和一个基于我的建议的函数。后者执行大约 1000 次比较而不是 10000 次，根据 timeit 的说法，尽管由于初始排序而产生了开销，但它的速度大约快了 4 倍。

代码如下：

##reference function
def f1(basedata):
    outd={}
    for ref in basedata:
        for comp in basedata:
            if ref == comp:
                continue
            elif ref[1] >= comp[1] and ref[2] <= comp[2]:
                if not outd.get(ref[0], False) or comp[3] < outd[ref[0]][1]:
                    outd[ref[0]] = (comp[0], comp[3])
    return outd

##optimized(?) function
def f2(basedata):
    outd={}
    sorteddata = sorted(basedata, key=lambda x:(x[1],-x[2]))
    runs = 0
    for i,ref in enumerate(sorteddata):
        toohigh=False
        j=i
        while j < len(sorteddata)-1 and not toohigh:
            j+=1
            runs+=1
            comp=sorteddata[j]
            if comp[1] > ref[2]:
                toohigh=True
            elif comp[2] <= ref[2]:
                if not outd.get(comp[0], False) or ref[3] < outd[comp[0]][1]:
                    outd[comp[0]] = (ref[0], ref[3])
    print(runs)
    return outd

【讨论】：

整洁。事先对数据进行排序的好主意。对于小案例，应该没什么区别，但是由于 OP 谈论数百万行，这真的很相关。一直困扰着我的一件事，也许只是为了完整起见：从csv 读取后，我们仍然需要int 转换，对吧？
是的，在实际应用中需要转换。我用randint 为我的测试生成了值，所以我不需要它

【解决方案3】：

使用 sqlite3 内存数据库，搜索可以移动到比建议方法更优化的 B-tree 索引。以下方法的工作速度比原始方法快 30 倍。对于生成的 2M 行文件，计算每个项目的结果需要 44 小时（原始方法约为 1200 小时）。

import csv
import sqlite3
import sys
import time

with sqlite3.connect(':memory:') as con:
    cursor = con.cursor()
    cursor.execute('CREATE TABLE products (id integer PRIMARY KEY, product text, position_min int, position_max int, count_pos int)')
    cursor.execute('CREATE INDEX idx_products_main ON products(position_max, position_min, count_pos)')

    with open('product.csv', 'r') as products_file:
        reader = csv.reader(products_file, delimiter=';')
        # Omit parsing first row in file
        next(reader)

        for row in reader:
            row_id = row[0][len('A.'):] if row[0].startswith('A.') else row[0];
            cursor.execute('INSERT INTO products VALUES (?, ?, ?, ?, ?)', [row_id] + row)

    con.commit()

    with open('product_p.csv', 'wb') as write_file:
        with open('product.csv', 'r') as products_file:
            reader = csv.reader(products_file, delimiter=';')
            # Omit parsing first row in file
            next(reader)

            for row in reader:
                row_product_id, row_position_min, row_position_max, count_pos = row
                result_row = cursor.execute(
                    'SELECT product, count_pos FROM products WHERE position_min <= ? AND position_max >= ? ORDER BY count_pos, id LIMIT 1',
                    (row_position_min, row_position_max)
                ).fetchone()

                if (result_row and result_row[0] == row_product_id):
                    result_row = cursor.execute(
                        'SELECT product, count_pos FROM products WHERE product != ? AND position_min <= ? AND position_max >= ? ORDER BY count_pos, id LIMIT 1',
                        (row_product_id, row_position_min, row_position_max)
                    ).fetchone()

                if (result_row):
                    write_file.write(f'{row_product_id};{result_row[0]};{result_row[1]}\n'.encode())

如果需要，可以使用线程进行进一步优化，并且可以优化结果过程，例如使用 10 个线程需要 4-5 小时。

【讨论】：