在文本文件和字典值组合中搜索字符串答案

【问题标题】：String searching in text file and dict values combinations在文本文件和字典值组合中搜索字符串
【发布时间】：2021-04-29 02:37:53
【问题描述】：

我是 python 的初学者，我正在大学学习它，教授在考试前给了我们一些工作要做。目前，我被这个程序困住已经快 2 周了，规则是我们不能使用任何库。基本上，我有这本词典，有多种从古代语言翻译成英语的可能性，一本从英语到意大利语的词典（只有 1 个键 - 1 个值对），一个古代语言的文本文件和另一个意大利语文本文件。到目前为止，我所做的基本上是扫描古代语言文件并使用字典搜索相应的字符串（使用 .strip(".,:;?!") 方法），现在我保存了那些包含至少 2 个单词的相应字符串在字符串列表中。现在是困难的部分：基本上我需要尝试所有可能的翻译组合（从古代语言到英语的值），然后将这些从英语到意大利语的翻译带到另一个字典中，并检查该字符串是否存在于意大利语文件中，如果是的话我保存了结果和找到的段落（不同段落中的结果不计算在内，必须与我制作的一小段代码来计算段落相同）。我在这里遇到问题的原因如下：

在我发现的字符串中，我应该如何替换单词并保留标点符号？因为返回结果必须包含所有的标点符号，否则输出结果会出错
如果字符串包含在文本的 2 行不同的行中，我应该如何进行才能使其正常工作？例如，我有一个 5 个单词的字符串，在一行的末尾我找到了对应的前 2 个单词，但其余 3 个单词是下一行的前 3 个单词。
如前所述，从古代语言到英语的字典很大，每个键（古代语言）最多可以有 7 个值（翻译），是否有任何有效的方法来尝试所有组合，同时搜索字符串是否存在于一个文本文件？这可能是最难的部分。处理这个问题的最好方法可能是每次逐字扫描，如果序列被破坏，我会以某种方式重置它并继续扫描文本文件...... 有什么想法吗？

这里你已经注释了我到目前为止所做的代码：

k = 2       #Random value, the whole program gonna be a function and the "k" value will be different each time

file = [ line.strip().split(';') for line in open('lexicon-GR-EN.csv', encoding="utf8").readlines() ]       #Opening CSV file with possible translations from ancient Greek to English
gr_en = { words[0]: tuple(words[1:]) for words in file }                                                    #Creating a dictionary with the several translations (values)    



file = open('lexicon-EN-IT.csv', encoding="utf8")     # Opening 2nd CSV file
en_it = {}                                            # Initializing dictionary
for row in file:                                      # Scanning each row of the CSV file (From English to Italian)
    L = row.rstrip("\n").split(';')                   # Clearing newline char and splitting the words
    x = L[0]
    t1 = L[1]
    en_it[x] = t1                                     # Since in this CSV file all the words are 1 - 1 is not necesary any check for the length (len(L) is always 2 basically)
                              
    
file = open('odyssey.txt', encoding="utf8")           # Opening text file
result = ()                                           # Empty tuple
spacechecker = 0                                      # This is the variable that i need to determine if i'm on a even or odd line, if odd the line will be scanned normaly otherwise word order and words will be reversed
wordcount = 0                                         # Counter of how many words have been found  
paragraph = 0                                         # Paragraph counter, starts at 0
paragraphspace = 0                                    # Another paragraph variable, i need this to prevent double-space to count as paragraph  
string = ""                                           # Empty string to store corresponding sequences 
foundwords = []                                       # Empty list to store words that have been found 
completed_sequences = []                              # Empty list, here will be stored all completed sequences of words      
completed_paragraphs = []                             # Paragraph counter, this shows in which paragraph has been found each sequence of completed_sequences

for index, line in enumerate(file.readlines()):       # Starting line by line scan of the txt file  
        words = line.split()                          # Splitting words
        if not line.isspace() and index == 0:         # Since i don't know nothing about the "secret tests" that will be conducted with this program i've set this check for the start of the first paragraph to prevent errors: if first line is not space  
            paragraph += 1                            # Add +1 to paragraph counter  
            spacechecker += 1                         # Add +1 to spacechecker
            
        elif not line.isspace() and paragraphspace == 1:     # Checking if the previous line was space and the current is not                   
            paragraphspace = 0                               # Resetting paragraphspace (precedent line was space) value  
            spacechecker += 1                                # Increasing the spacechecker +1
            paragraph +=1                                    # This means we're on a new paragraph so +1 to paragraph
            
        elif line.isspace() and paragraphspace == 1:         # Checking if the current line is space and the precedent line was space too.
            continue                                         # Do nothing and cycle again
            
        elif line.isspace():                                 # Checking if the current line is space  
            paragraphspace += 1                              # Increase paragraphspace (precedent line was space variable) +1
            continue
        else:
            spacechecker += 1                                # Any other case increase spacechecker +1    
            
            
        if spacechecker % 2 == 1:                                           # Check if spacechecker is odd
        
            for i in range(len(words)):                                     # If yes scan the words in normal order
            
                if words[i].strip(",.!?:;-") in gr_en != "[unavailable]":                      # If words[i] without any special char is in dictionary
                    currword = words[i]                                                        # If yes, we will call it "currword"
                    foundwords.append(currword)                                                # Add currword to the foundwords list    
                    wordcount += 1                                                             # Increase wordcount +1 
                    
                elif (words[i].strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword not in gr_en and wordcount >= k):     #Elif check if it's not in dictionary but wordcount has gone over k
                     string = " ".join(foundwords)                                      # We will put the foundwords list in a string
                     completed_sequences.append(string)                                 # And add this string to the list of strings of completed_sequences
                     completed_paragraphs.append(paragraph)                             # Then add the paragraph of that string to the list of completed_paragraphs
                     result = list(zip(completed_sequences, completed_paragraphs))      # This the output format required, a tuple with the string and the paragraph of that string
                     wordcount = 0
                     
                     foundwords.clear()                                                 # Clearing the foundwords list
                   
                else:                                                     # If none of the above happened (word is not in dictionary and wordcounter still isn't >= k)
                    wordcount = 0                                         # Reset wordcount to 0  
                    foundwords.clear()                                    # Clear foundwords list      
                    continue                                              # Do nothing and cycle again 
                    
                    
        else:                                                             # The case of spacechecker being not odd,
            words = words[::-1]                                           # Reverse the word order
            
            for i in range(len(words)):                                        # Scanning the row of words
                currword = words[i][::-1]                                      # Currword in this case will be reversed since the words in even lines are written in reverse.
                if currword.strip(",.!?:;-") in gr_en != "[unavailable]":      # If currword without any special char is in dictionary
                    foundwords.append(currword)                                # Append it to the foundwords list 
                    wordcount += 1                                             # Increase wordcount +1     
                    
                elif (currword.strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword.strip(",.!?:;-") not in gr_en and wordcount >= k):     #Elif check if it's not in dictionary but wordcount has gone over k
                     string = " ".join(foundwords)                                  # Add the words that has been found to the string  
                     completed_sequences.append(string)                             # Append the string to completed_sequences list      
                     completed_paragraphs.append(paragraph)                         # Append the paragraph of the strings to the completed_paragraphs list  
                     result = list(zip(completed_sequences, completed_paragraphs))  # Adding to the result the tuple combination of strings and corresponding paragraphs
                     wordcount = 0                                                  # Reset wordcount
                     
                     foundwords.clear()                                             # Clear foundwords list
                    
                else:                                                     # In case none of above happened     
                    wordcount = 0                                         # Reset wordcount to 0  
                    foundwords.clear()                                    # Clear foundwords list  
                    continue                                              # Do nothing and cycle again

【问题讨论】：

你可以发布一些你的代码吗？
@horcrux 如果您想知道为什么所有关于检查行是否奇数或偶数的东西是因为古希腊语中的文本写成 1 行正常和 1 行反向（单词必须是颠倒和他们的顺序也是）
我已经完成了对代码的 cmets，现在应该更容易阅读：pastebin.com/7eQEN5PG

标签： python file dictionary

【解决方案1】：

我可能会采取以下方法来解决这个问题：

尝试将 2 个单词的字典合并为一个（下面的ancient_italian），从等式中删除英语。例如，如果古英语->英语有{"canus": ["dog","puppy", "wolf"]}，英语->意大利语有{"dog":"cane"}，那么您可以创建一个新字典{"canus": "cane"}。（当然，如果 English->Italian dict 包含所有 3 个英文单词，您需要选择一个，或者在输出中显示类似 cane|cucciolo|lupo 的内容。
想出一个可以区分单词和分隔符（标点）的正则表达式，并按顺序输出到一个列表中（word_list下面）。即类似['ecce', '!', ' ', 'magnus', ' ', 'canus', ' ', 'esurit', '.']
遍历此列表，生成一个新列表。比如：

translation = []
    for item in word_list:
      if item.isalpha():
        # It's a word - translate it and add to the list
        translation.append(ancient_italian[item])
      else:
        # It's a separator - add to the list as-is
        translaton.append(item)

终于重新加入列表：''.join(translation)

【讨论】：

首先感谢您帮助我！ 1）据我所知，除了古希腊-英语词典中[不可用]的那些词（但它们甚至无关紧要）之外，每一个英语翻译都有一个意大利语翻译。将 GR_EN 的每个值与相应的 EN_IT 字典合并的有效代码是什么？ 2）这看起来不错，但我在管理字符串方面很糟糕，有什么方法可以分割单词和标点符号？ .split() 方法在这种情况下有用吗？ 3）这不会将单词和标点符号放在列表的不同索引处吗？

【解决方案2】：

我无法通过比赛回复您对答案的评论，但这可能会有所帮助：

首先，它不是最优雅的方法，但应该可行：

GR_IT = {}
for greek,eng in GR_EN.items():
    for word in eng:
        try:
            GR_IT[greek] = EN_IT[word]
        except:
            pass

如果一个词没有翻译，它会被忽略。

要获取单词列表和标点符号拆分，请尝试以下操作：

def repl_punc(s):
    punct = ['.',',',':',';','?','!']
    for p in punct:
        s=s.replace(p,' '+p+' ')
    return s
repl_punc(s).split()

【讨论】：

这本词典可以工作，但看起来每个希腊语单词只需要 1 次意大利语翻译。相反，我需要从 gr_en 的元组中获取所有可能的翻译，以使其 100% 准确。关于这个功能看起来很好用，结合一些控件将使现有单词的检查更容易，非常感谢！
尝试在第一个 for 循环开始后添加 translations = [] 之类的内容，然后在 try 块中将其更改为 translations.append(EN_IT[word]); GR_IT[greek] = translations。如果这不起作用，您可以提供翻译词典的 sn-p 吗？如果这也有用，请务必点赞！
看起来它有效，但我不明白为什么从控制台变量中它说 GR_EN 的 4375 个元素和 EN_IT 的 4084 个元素，而它们应该具有完全相同的数量。现在是时候想出一个算法来让整个事情发挥作用了。总之非常感谢！得到我的支持！
好吧，没关系，我刚刚意识到这比我预期的效果更好：所有“不可用”的单词，所以不在 EN_IT 字典中，根本没有去那里，所以它很干净仅列出我有实际翻译的单词。看起来很完美
很高兴为您提供帮助！ :)