python：复杂的字符串算法答案

【问题标题】：python: complex string algorithmpython：复杂的字符串算法
【发布时间】：2010-08-25 23:15:39
【问题描述】：

我有一个清单

listcdtitles = 

["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """,
..............................]

我在这个列表中有大约 14,000 个元素

我想把那些有相似词的字符串组合在一起。

关于如何做到这一点的任何想法？我认为没有正确/错误的方式

非常感谢您的建议

【问题讨论】：

请把'bunch up'定义得更清楚一点。
我希望他们被连接起来，这并不重要，如果你愿意，你也可以得到这个位置
与 levenshtein/soundex 中的类似？如果是这样，您可能必须在每个字符串之间创建一个距离矩阵。如果可以通过排序完成类似...而不是在列表中读取它并使用sorted() 方法。
还是有点迷茫，所以你想用相似词连接列表条目。这可能需要多次传递，因此最终会出现一些（甚至一个）长条目，其中包含不相交的单词集，因为此类数据将包含很多相似的单词（aria、chorale、from、at等）
是的，这很好。我会先过滤掉那些我不想要的词，比如“as, the, then, for”

标签： python string

【解决方案1】：

我是 python 语言的新手，但我编写了一个示例代码来计算该列表中条目之间的相似度分数。

代码如下。

import re
import array

listcdtitles = ["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """]

entryDictionary = {}
i=0
for entry in listcdtitles:
    #remove unnecessary characters from the string
    entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE)
    #split the entry into words and store it in the 
    entryDictionary[i]=entry.split(" ")
    i=i+1
# print the dictionary
print("Entries")
print(entryDictionary)

# define a score matrix, compare the words in each entry and if
# a word is same in both entries, that is one point
scoreMatrix = []
for k in range(i):
    scoreMatrix.append([])
    for j in range (i):
        if j>k:
            scoreMatrix[k].append(0)
        else:
            scoreMatrix[k].append("-")
k=0
j=0

for k in range(i-1):
    entry1 = entryDictionary[k]
    for j in range(k+1,i):
        entry2 = entryDictionary[j]
        for kk in range(len(entry1)):
            for jj in range(len(entry2)):
                if entry1[kk] != "" and entry1[kk] == entry2[jj]:
                    scoreMatrix[k][j] = scoreMatrix[k][j] + 1

print "Score Matrix (Higher numbers denote heigher similarity between two entries"

print repr("").rjust(10),
for k in range(i-1):
    print repr("Entry " + str(k)).rjust(10),
print repr("Entry " + str(i-1)).rjust(10)

for k in range(i):
    scoreMatrix.append([])
    print repr("Entry " + str(k)).rjust(10),
    for j in range (i-1):
        print repr(scoreMatrix[k][j]).rjust(10),
    print repr(scoreMatrix[k][i-1]).rjust(10)

结果如下：分数矩阵（数字越大表示两个条目之间的相似度越高

        ''  'Entry 0'  'Entry 1'  'Entry 2'  'Entry 3'  'Entry 4'  'Entry 5'  'Entry 6'  'Entry 7'  'Entry 8'  'Entry 9' 'Entry 10' 'Entry 11' 'Entry 12' 'Entry 13'
 'Entry 0'        '-'          2          3          2          0          1          1          0          1          1          0          0          0          0
 'Entry 1'        '-'        '-'          0          0          0          0         11          0          2          5          0          0          0          0
 'Entry 2'        '-'        '-'        '-'          3          0          1          0          1          0          0          0          0          0          0
 'Entry 3'        '-'        '-'        '-'        '-'          0          4          0          2          0          0          2          0          0          0
 'Entry 4'        '-'        '-'        '-'        '-'        '-'          0          1          0          0          0          0          1          0          0
 'Entry 5'        '-'        '-'        '-'        '-'        '-'        '-'          0          3          1          0          1          1          0          0
 'Entry 6'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          2          5          0          1          0          0
 'Entry 7'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0          0          0          0
 'Entry 8'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          2          0          0          0          0
 'Entry 9'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0          0
'Entry 10'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0
'Entry 11'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0
'Entry 12'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          2
'Entry 13'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'

【讨论】：

嘿，zafer，你真是太棒了。但我真的不明白这是做什么的？你能解释一下吗
嗨，我更新了代码。其中有一些错误。现在您可以看到矩阵。在那里你会看到条目之间的相似度分数。多么模拟。分数计算如下： entry 1:'a',B,D entry 2:A,c,X Score 是 1 因为 'a' 和 A。这是一个简单的方法。我希望它有所帮助。
我已经使用以下行删除了条目： entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE) 这样算法只是使用单词进行计算。可以添加诸如“和”之类的排除词以提高质量。

【解决方案2】：

首先，解析所有这些并将每个标记与频率相关联。高频代币必须被列入黑名单。

然后您必须比较字符串，遍历它们，并将元组与距离分数相关联。根据此分数，您将连接它们 - 或不连接。

这将是一个简单的方法。

【讨论】：