【问题标题】:File comparison in two directories两个目录中的文件比较
【发布时间】:2021-10-28 06:52:28
【问题描述】:

我正在比较两个目录中的所有文件,如果比较大于 90%,所以我继续外循环,我想删除第二个目录中匹配的文件,以便第一个目录中的第二个文件不会t 与已经匹配的文件进行比较。

这是我尝试过的:

for i for i in sorted_files:
    for j in sorted_github_files:
            #pdb.set_trace()
            with open(f'./files/{i}') as f1:
                try:
                    text1 = f1.read()
                except:
                    pass
            with open(f'./github_files/{j}') as f2:
                try:
                    text2 = f2.read()
                except:
                    pass
            m = SequenceMatcher(None, text1, text2)
            print("file1:", i, "file2:", j)
            if m.ratio() > 0.90:
                 os.remove(f'./github_files/{j}')
                 break

我知道一旦迭代开始运行,我就无法更改迭代,这就是为什么它返回我的文件未找到错误我不想使用 try 除了块。任何想法表示赞赏

【问题讨论】:

  • 请提供足够的代码,以便其他人更好地理解或重现问题。

标签: python-3.x loops comparison


【解决方案1】:

有几点需要指出:

  • 始终提供minimal reproducible example
  • 您的第一个 for 循环不起作用,因为您使用了 `for i for i ..``
  • 如果要先遍历 list1 (sorted_files) 中的文件,然后在第二个循环之外读取文件
  • 我会将匹配率超过 0.90 的文件添加到新列表中,然后删除这些文件,这样您的项目就不会在迭代期间发生变化
  • 你可以找到我创建和使用的测试数据here
import os
from difflib import SequenceMatcher

# define your two folders, full paths
first_path = os.path.abspath(r"C:\Users\XYZ\Desktop\testfolder\a")
second_path = os.path.abspath(r"C:\Users\XYZ\Desktop\testfolder\b")

# get files from folder
first_path_files = os.listdir(first_path)
second_path_files = os.listdir(second_path)

# join path and filenames
first_folder = [os.path.join(first_path, f) for f in first_path_files]
second_folder = [os.path.join(second_path, f) for f in second_path_files]

# empty list for matching results
matched_files = []

# iterate over the files in the first folder
for file_one in first_folder:
    # read file content
    with open(file_one, "r") as f:
        file_one_text = f.read()

    # iterate over the files in the second folder
    for file_two in second_folder:
        # read file content
        with open(file_two, "r") as f:
            file_two_text = f.read()

        # match the two file contents
        match = SequenceMatcher(None, file_one_text, file_two_text)
        if match.ratio() > 0.90:
            print(f"Match found ({match.ratio()}): '{file_one}' | '{file_two}'")
            # TODO: here you have to decide if you rather want to remove files from the first or second folder
            matched_files.append(file_two)  # i delete files from the second folder

# remove duplicates from the resulted list
matched_files = list(set(matched_files))

# remove the files
for f in matched_files:
    print(f"Removing file: {f}")
    os.remove(f)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-03-18
    • 2020-01-18
    • 1970-01-01
    • 2012-12-31
    • 2012-10-12
    • 2017-03-28
    • 2023-04-05
    • 1970-01-01
    相关资源
    最近更新 更多