【问题标题】:Find all common contiguous substrings of two strings in python [duplicate]在python中查找两个字符串的所有常见连续子字符串[重复]
【发布时间】:2017-08-28 03:01:31
【问题描述】:

我有两个字符串,我想找到所有常用词。例如,

s1 = 'Today is a good day, it is a good idea to have a walk.'

s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'

考虑 s1 匹配 s2

'Today is' 匹配 'today is' 但 'Today is a' 不匹配 s2 中的任何字符。因此,“今天是”是常见的连续字符之一。同样,我们有“美好的一天”、“是”、“美好的”、“散步”。所以常用词是

common = ['today is', 'a good day', 'is', 'a good', 'have a walk']

我们可以使用正则表达式来做到这一点吗?

非常感谢。

【问题讨论】:

  • 您在寻找常用词或常用短语吗?您是否试图避免重复计算匹配,因为诸如“美好的一天”之类的短语可能会被分解为“美好”,然后再进行评估。
  • 您的标准需要收紧:例如 s1 中的 Today,而昨天 s2 中的 day 有共同点

标签: python regex string


【解决方案1】:
import string
s1 = 'Today is a good day, it is a good idea to have a walk.'
s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
z=[]
s1=s1.translate(None, string.punctuation) #remove punctuation
s2=s2.translate(None, string.punctuation)
print s1
print s2
sw1=s1.lower().split()                   #split it into words
sw2=s2.lower().split()
print sw1,sw2
i=0
while i<len(sw1):          #two loops to detect common strings. used while so as to change value of i in the loop itself
    x=0
    r=""
    d=i
    #print r
    for j in range(len(sw2)):
        #print r
        if sw1[i]==sw2[j]:
            r=r+' '+sw2[j]                       #if string same keep adding to a variable
            x+=1
            i+=1
        else:
            if x>0:     # if not same check if there is already one in buffer and add it to result (here z)
                z.append(r)
                i=d
                r=""
                x=0
    if x>0:                                            #end case of above loop
        z.append(r)
        r=""
        i=d
        x=0
    i+=1 
    #print i
print list(set(z)) 

#O(n^3)

【讨论】:

    【解决方案2】:

    引用自Find common substring between two strings

    修改了几行,增加了几行 如果未找到任何子字符串,则默认返回 answer = "NULL" 。

    已添加 继续搜索,直到你得到 NULL 并存储到 List

    def longestSubstringFinder(string1, string2):
        answer = "NULL"
        len1, len2 = len(string1), len(string2)
        for i in range(len1):
            match = ""
            for j in range(len2):
                if (i + j < len1 and string1[i + j] == string2[j]):
                    match += string2[j]
                else:
                    if (len(match) > len(answer)): answer = match
                    match = ""
        return answer
    
    
    mylist = []
    
    def call():
        s1 = 'Today is a good day, it is a good idea to have a walk.'
    
        s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
        s1 =  s1.lower()
        s2 = s2.lower()
        x = longestSubstringFinder(s2,s1)
        while(longestSubstringFinder(s2,s1) != "NULL"): 
            x = longestSubstringFinder(s2,s1)
            print(x)
            mylist.append(x)
            s2 = s2.replace(x,' ')
    
    call()
    print ('[%s]' % ','.join(map(str, mylist)))
    

    输出

    [ a good day, , have a walk,today is , good]
    

    输出的不同

    common = ['today is', 'a good day', 'is', 'a good', 'have a walk']
    

    您对第二个 "is" 的期望是错误的,正如您在 s2 中看到的那样,只有一个“is”

    【讨论】:

    • 谢谢你,Hariom Singh,你是对的。
    • 程序不适用于提到的输入:s1 = '今天是个好日子,散步是个好主意。',s2 = '昨天不是个好日子,但今天是美好的一天,我们去散步好吗?'
    • @Poonam 运行良好,您执行 call() 函数了吗?
    • @stackoverflow.com/users/7590993/hariom-singh - 是的,我认为不允许重复出现。就像一旦我将“今天是美好的一天”作为最长的字符串,“美好”不应该重复。但根据问题,您的逻辑运行良好。
    猜你喜欢
    • 2011-03-23
    • 2019-01-12
    • 2020-05-28
    • 2023-03-18
    • 2017-07-07
    • 2012-04-20
    • 2014-02-27
    • 1970-01-01
    相关资源
    最近更新 更多