【问题标题】：Compare each element of CSV file to every element of a different CSV file, and find the most similar elements将 CSV 文件的每个元素与不同 CSV 文件的每个元素进行比较，并找到最相似的元素
【发布时间】：2020-12-30 08:19:25
【问题描述】：

我有两个需要比较的 CSV 文件。第一个叫 SAP.csv，第二个叫 SAPH.csv。

SAP.csv 有这些单元格：

Notification    Description
5000000001      Detailed Inspection of Masts (2100mm) (3
5000000002      Ceremonial Awnings-Survey and Load Test
5000000003      HPA-Carry out 4000 hour service routine
5000000004      UxE 8 in Number Temperature Probs for C
5000000005      Overhaul valves

...而 SAPH.csv 有这些单元格：

Notification   Description
4000000015     Detailed Inspection of Masts (2100mm) (3
4000000016     Ceremonial Awnings-Survey and Load Test
4000000017     HPA-Carry out 8000 hour service routine
4000000018     UxE 8 in Number Temperature Probs for C
4000000019     Represerve valves
4000000020     STW System

它们是相似的，但有些行，例如第四行，（HPA-执行 4000 小时服务例程与 HPA-执行 8000 小时服务例程），略有不同。

我想将 SAP.csv 的每个值与 SAPH.csv 的每个值进行比较，并使用余弦相似度找到最相似的行，以便输出看起来像这样（此处的相似度百分比只是示例，而不是它们实际的样子）：

Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test  - 100%
HPA-Carry out 4000 hour service routine  - 85%
UxE 8 in Number Temperature Probs for C  - 90%
Overhaul valves                          - 0%

发布答案编辑

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

Traceback（最近一次调用最后一次）：

文件“”，第 1 行，在

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

运行文件中的文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”，第 786 行

execfile(filename, namespace)

文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”，第 110 行，在 execfile 中

exec(compile(f.read(), filename, 'exec'), namespace)

文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”，第 31 行，在

similarity_score = similar(job, description) # Get their similarity

文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”，第 14 行，类似

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”，第 173 行，距离

return self.maximum(*sequences) - self.similarity(*sequences)

文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”，第 176 行，相似

return self(*sequences)

文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py”，第 175 行，在 call

return intersection / pow(prod, 1.0 / len(sequences))

ZeroDivisionError：浮点除以零

第二次编辑，因为上面的解决方案

所以原始请求只有两个输出 - 描述和相似度分数。

描述来自 SAP 相似性来自于文本距离计算

解决办法可以修改成以下吗

通知（这是 SAP 文件中的 10 位数字）描述（目前是这样）相似性（目前是这样）通知（这个数字来自 SAPH 文件，是提供相似度分数的那个）

所以一个示例行输出应该是这样的

80000115360 附加材料 FWD 绳护罩 86.24% 7123456789

这将沿着 A、B、C、D 列

A、B 来自 SAP 计算 C D来自SAPH

编辑 3

运行文件中的文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”，第 786 行

execfile(filename, namespace)

文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”，第 110 行，在 execfile 中

exec(compile(f.read(), filename, 'exec'), namespace)

文件“C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”，第 16 行，在

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 702 行，在 parser_f 中

return _read(filepath_or_buffer, kwds)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 429 行，在 _read

parser = TextFileReader(filepath_or_buffer, **kwds)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 895 行，在 init

self._make_engine(self.engine)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 1122 行，在 _make_engine 中

self._engine = CParserWrapper(self.f, **self.options)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 1853 行，在 init

self._reader = parsers.TextReader(src, **kwds)

文件“pandas/_libs/parsers.pyx”，第 490 行，在 pandas._libs.parsers.TextReader.cinit

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”，第 2017 行，在 pandas_dtype 中

dtype))

TypeError: 数据类型“字符串”不理解

编辑后 4 - 25/10/20

您好，所以遇到了和我之前想的一样的错误

此电子邮件可能包含 BAE Systems 和/或第三方的专有信息。

运行文件中的文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”，第 786 行

execfile(filename, namespace)

文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”，第 110 行，在 execfile 中

exec(compile(f.read(), filename, 'exec'), namespace)

文件“C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”，第 16 行，在

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 702 行，在 parser_f 中

return _read(filepath_or_buffer, kwds)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 435 行，在 _read

data = parser.read(nrows)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 1139 行，正在读取中

ret = self._engine.read(nrows)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 2421 行，正在读取中

data = self._convert_data(data)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 2487 行，在 _convert_data 中

clean_conv, clean_dtypes)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 1705 行，在 _convert_to_ndarrays 中

cvals = self._cast_types(cvals, cast_type, c)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”，第 1808 行，在 _cast_types 中

copy=True, skipna=True)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py”，第 623 行，在 astype_nansafe 中

dtype = pandas_dtype(dtype)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”，第 2017 行，在 pandas_dtype 中

dtype))

TypeError: 数据类型“字符串”不理解

我了解到您对分隔符的看法，因此我将一个 csv 文件上传到 repl.it，它看起来好像 "," 是分隔符。

因此修改了代码以适应。当我在 repl.it 上这样做时，它起作用了。

这是我正在使用的代码

导入文本距离

将熊猫导入为 pd

def similar(a, b): # 改编自这里：https://stackoverflow.com/a/63838615/8402369

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

return similarity * 100

读取 CSV

SAP = pd.read_csv('H:\Documents/Python/导入到 Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python ")

SAPH = pd.read_csv('H:\Documents/Python/导入到 Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python ")

创建一个 pandas 数据框来存储输出。 'Description' 列填充了 SAP['Description']

的值

scores = pd.DataFrame(SAP['Description'], columns = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])

存储最高相似度分数的临时变量

highest_score = 0

desc = 0

通过 SAP['Description'] 进行迭代

在 SAP['Description'] 中的工作：

highest_score = 0 # 每次迭代重置highest_score

for description in SAPH['Description']: # 遍历 SAPH['Description']

similarity_score = similar(job, description) # Get their similarity



if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values

  highest_score = similarity_score

  desc = str(description)

if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.

  break

使用最高分数和其他值更新数据框“分数”

打印(SAPH['Description'][SAPH['Description'] == desc])

scores['Notification (SAP)'][scores['Description'] == job] = SAP['Notification'][SAP['Description'] == job]

scores['Similarity'][scores['Description'] == job] = f'{highest_score}%'

scores['Notification (SAPH)'][scores['Description'] == job] = SAPH['Notification'][SAPH['Description'] == desc]

打印（分数）

在没有索引列的情况下将其输出到 Scores.csv

以 open('./Scores.csv', 'w') 作为文件：

file.write(scores.__repr__())

在 Spyder (Python 3.7) 上运行

【问题讨论】：

请提供每个 csv 的样本数据，以便我们进行测试。谢谢。
嗨 Mike，我如何将 csv 文件添加到帖子中？
只需更新您的帖子并粘贴到每个文件（包括标题）的 10 个示例行中
道歉迈克，当我去粘贴它时，它会添加一张服务器拒绝的图片。显然我错过了一些非常简单的事情
@Mike67 ugh，你为什么要鼓励 OP 让别人做格式而不是自己做而不是给 OP @987654322 @

标签： python-3.x csv cosine-similarity

【解决方案1】：

@George_Pipas's answer 到 this question 演示了一个使用库 textdistance 的示例（我在这里解释他的部分答案）：

解决方案是使用textdistance 库。我将提供Cosine Similarity的示例
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
我们得到：
0.5

所以，我们可以创建一个相似度查找函数：

def similar(a, b):
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b)     
    return similarity

根据相似度，如果a 和b 更相似，这将输出一个更接近1 的数字，如果不是，它将输出一个更接近0 的数字。所以如果a === b，输出会是1，但是如果a !== b，输出会小于1。

要获得百分比，只需将输出乘以 100。像这样：

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100

使用pandas 可以轻松读取CSV 文件：

# Read the CSVs
SAP = pd.read_csv('SAP.csv') 
SAPH = pd.read_csv('SAPH.csv')

我们创建另一个pandas dataframe 来存储我们将计算的结果：

# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity'])

现在，我们遍历SAP['Description'] 和SAPH['Description']，将每个元素与其他元素进行比较，计算它们的相似度，并将最高值保存到scores。

# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}

# Iterate though SAP['Description']
for job in SAP['Description']:
  highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
  for description in SAPH['Description']: # Iterate through SAPH['Description']
    similarity_score = similar(job, description) # Get their similarity

    if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
      highest_score['score'] = similarity_score
      highest_score['description'] = description
    if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
      break
  # Update the dataframe 'scores' with highest_score
  scores['SAPH'][scores['SAP'] == job] = highest_score['description'] 
  scores['Similarity'][scores['SAP'] == job] = highest_score['score']

这是一个细分：

创建了一个临时变量highest_score 来存储计算得出的最高分数。
现在我们遍历SAP['Description']，并在内部遍历SAPH['Description']。这允许我们将SAP['Description'] (job) 的每个值与SAPH['Description'] (description) 的每个值进行比较。
在迭代 SAPH['Description'] 时，我们：
1. 计算job 和description 的相似度得分
2. 如果它高于highest_score 中保存的分数，我们会相应地更新highest_score；否则我们继续
3. 如果similarity_score 等于100，我们知道这是一个完美的匹配，不必继续寻找。在这种情况下，我们打破了循环。
在SAPH['Description'] 循环之外，现在我们已经将job 与SAPH['Description'] 的每个元素进行了比较（或找到了完美匹配），我们将值保存到scores。

这对SAP['Description'] 的每个元素重复。

这是 scores 完成后的样子：

                                        SAP                                      SAPH Similarity
0  Detailed Inspection of Masts (2100mm) (3  Detailed Inspection of Masts (2100mm) (3        100
1   Ceremonial Awnings-Survey and Load Test   Ceremonial Awnings-Survey and Load Test        100
2   HPA-Carry out 4000 hour service routine   HPA-Carry out 8000 hour service routine    94.7368
3   UxE 8 in Number Temperature Probs for C   UxE 8 in Number Temperature Probs for C        100
4                           Overhaul valves                         Represerve valves    53.4522

在将其输出到 CSV 文件后：

# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)

...Scores.csv 如下所示：

SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488

View the full code, and run and edit it online

请注意， textdistance 和 pandas 是为此所需的库。如果您还没有安装它们，请使用：

pip install textdistance pandas

注意事项：

您可以通过将f'{highest_score}%' 替换为以下内容来舍入百分比：f'{round(highest_score, NUMBER_OF_PLACES_TO_ROUND_TO)}%'
Here's a formatted version 和 here's the code

编辑：（针对在 cmets 中提到的遇到的问题）

这是相似函数的错误捕获版本：

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
  try: 
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100
  except ZeroDivisionError:
    print('There was an error. Here are the values of a and b that were passed')
    print(f'a: {repr(a)}')
    print(f'b: {repr(b)}')
    exit()

【讨论】：

嗨，marsnebula，有没有办法让输出 CSV 出现在两列上？描述和相似性？谢谢
The last link I provided 就是这样做的。
您希望它用逗号分隔，而不是格式化吗？因为这会返回一个格式化的版本。
@Andy_Stillwell - 喜欢this？ Code
您好，当它导出到 csv 时，所有数据都包含在 A 列中。A 列有描述，B 列有相似度分数吗？