【发布时间】:2021-07-14 11:11:30
【问题描述】:
我有一个熊猫系列description,我使用 sklearn 计算了句子之间的相似度
0 0114600043776001 loan payment receipt
1 ogsg s u b e b june 2018 salar
2 sal admin charge
3 sms alert charge outstanding
4 vat onverve*issuance fee*506108*********1112
5 verve*issuance fee*506108*********1112
6 visa credit card repayment jul 2018
7 trsf 0043776013 12140114fcmb ijebu ode1
8 maint fee recovery jun 2018
9 vat maint fee recovery jun 2018
10 0114600043776001 loan payment receipt
11 0114600043776001 loan payment receipt
12 ogsg subeb july sal
13 sms alert charge outstanding
14 trsf 0043776013 12141363fcmb ijebu ode2
15 maint fee recovery jul 2018
16 vat maint fee recovery jul 2018
17 recry card maintenance charge july 2018
18 ogsg subeb aug sal
19 433090995 wd 10322883 15 ibadan rd
def cosine_sim(description):
vectorizer = TfidfVectorizer(min_df=1,analyzer='word', ngram_range=(1, 3), stop_words='english')
tfidf_matrix = vectorizer.fit_transform(description)
# similarities of this doc
matches = cosine_similarity(tfidf_matrix, tfidf_matrix)
return matches
cosine_sim 函数返回一个矩阵数组,其值介于 (0,1) 之间。现在我想匹配相似度在 0.2 和 0.99999 之间的句子
similarities = cosine_sim(description)
nums = similarities[(0.2<similarities) & (similarities<=0.99999)] and return a list of list, I simple can't figure out a way around this.
我的预期输出应该是这样的
['0114600043776001 loan payment receipt',
'0421209017073500 loan payment receipt'],
['ogsg s u b e b june 2018 salar'],
['sal admin charge'],
['sms alert charge outstanding'],
['vat onverve*issuance fee*506108*********1112'],
['verve*issuance fee*506108*********1112'],
['visa credit card repayment jul 2018',
'visa credit card repayment sep 2018',
'visa credit card repayment oct 2018',
'visa credit card repayment nov 2018',
'visa credit card repayment aug 2018'],
['trsf 0043776013 12140114fcmb ijebu ode1',
'trsf 0043776013 12141363fcmb ijebu ode2'],
['maint fee recovery jun 2018',
'vat maint fee recovery jun 2018',
'maint fee recovery jul 2018',
'vat maint fee recovery jul 2018',
'maint fee recovery aug 2018',
'vat maint fee recovery aug 2018',
'maint fee recovery oct 2018',
'vat maint fee recovery oct 2018',
'maint fee recovery nov 2018',
'vat maint fee recovery nov 2018',
'maint fee recovery may 2018',
'vat maint fee recovery may 2018',
'maint fee 29 jun 2018 30 jul 2018',
'vat maint fee 29 jun 2018 30 jul 2018',
'maint fee 31 jul 2018 30 aug 2018']]
【问题讨论】:
标签: python pandas scikit-learn scipy