如果你安装fuzzywuzzy,你仍然会遇到一个问题,如何选择合适的启发式来选择正确的产品并切割那些选择不正确的产品(下面解释)
安装fuzzywuzzy:
pip install fuzzywuzzy
fuzzywuzzy 有多种比率计算方法 (examples on github)。您面临的问题是:如何选择最好的?我在你的数据上试过了,但都失败了。
代码:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
# df1 = ...
# df2 = ...
def get_top_by_ratio(x, df2):
product_values = df2.Product.values
# compare two strings by characters
ratio = np.array([fuzz.partial_ratio(x, val) for val in product_values])
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
将此函数应用于您的数据:
partial_ratio = (df1.Brand_var.apply(lambda x: get_top_by_ratio(x, df2))
.apply(pd.Series) # convert returned Series of tuples into pd.DataFrame
.rename(columns={0: 'ratio', 1: 'Product'})) # just rename columns
print(partial_ratio)
Out:
0 65 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 50 test # Altos Las Hormigas Argentinian Wine
2 33 test
3 50 test
4 50 test
这样不好。 fuzz.ratio、fuzz.token_sort_ratio 等其他比率方法也失败了。
所以我猜想扩展启发式来比较单词不仅字符可能会有所帮助。定义一个函数,该函数将从您的数据中创建词汇表,对所有句子进行编码并使用更复杂的启发式查找单词:
def create_vocab(df1, df2):
# Leave 0 index free for unknow words
all_words = set((df1.Brand_var.str.cat(sep=' ') + df2.Product.str.cat(sep=' ')).split())
vocab = dict([(i + 1, w) for i, w in enumerate(all_words)])
return vocab
def encode(string, vocab):
"""This function encodes a sting with vocabulary"""
return [vocab[w] if w in vocab else 0 for w in string.split()]
定义新的启发式:
def get_top_with_heuristic(x, df2, vocab):
product_values = df2.Product.values
# compare two strings by characters
ratio_per_char = np.array([fuzz.partial_ratio(x, val) for val in product_values])
# compare two string by words
ratio_per_word = np.array([fuzz.partial_ratio(x, encode(val, vocab)) for val in product_values])
ratio = ratio_per_char + ratio_per_word
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
创建词汇表,将复杂的启发式应用于数据:
vocab = create_vocab(df1, df2)
heuristic_rating = (df1.Brand_var.apply(lambda x: get_top_with_heuristic(x, df2, vocab))
.apply(pd.Series)
.rename(columns={0: 'ratio', 1: 'Product'}))
print(heuristic_rating)
Out:
ratio Product
0 73 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 61 Hormi 12 Yr Bottle # Altos Las Hormigas Argentinian Wine
2 45 Hormi 12 Yr Bottle
3 50 test
4 50 test
看来是对的!将此数据框连接到 df1,更改索引:
result_heuristic = pd.concat((df1, heuristic_rating), axis=1).set_index('Brand_var')
print(result_heuristic)
Out:
ratio Product
Brand_var
Altmeister Bitter 73 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine 61 Hormi 12 Yr Bottle
Amadeus Contri Sparkling Wine 45 Hormi 12 Yr Bottle
Amadeus Cream Liqueur 50 test
Amadeus Sparkling Sparkling Wine 50 test
现在您应该选择一些经验法则来删除不正确的数据。对于此示例,ratio <= 50 效果很好,但您可能需要一些研究来定义最佳启发式和正确阈值。无论如何,你也会得到一些错误。选择可接受的错误率,即 2%、5% ......并改进您的算法,直到达到它(此任务类似于机器学习分类算法的验证)。
删除不正确的“预测”:
result = result_heuristic[result_heuristic.ratio > 50][['Product']]
print(result)
Out: Product
Brand_var
Altmeister Bitter 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine Hormi 12 Yr Bottle
希望对你有帮助!
P.S.当然,这个算法非常非常慢,当你优化它时你应该做一些优化,例如缓存差异等。