【问题标题】:Finding Alliterative Word Sequences with Python使用 Python 查找头韵词序列
【发布时间】:2017-02-08 22:45:53
【问题描述】:

我正在使用 NLTK 3.2 使用 Python 3.6。

我正在尝试编写一个程序,它将原始文本作为输入并输出以相同字母开头的任何(最大)连续单词系列(即头韵序列)。

在搜索序列时,我想忽略某些单词和标点符号(例如,'it'、'that'、'into'、''s'、',' 和 '.'),但要包括它们在输出中。

例如输入

"The door was ajar. So it seems that Sam snuck into Sally's subaru."

应该让步

["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]

我是编程新手,我能想到的最好的方法是:

import nltk
from nltk import word_tokenize

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."

tokened_text = word_tokenize(raw)                   #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text]    #make it lowercase

for w in tokened_text:                              #for each word of the text
    letter = w[0]                                   #consider its first letter
    allit_str = []
    allit_str.append(w)                             #add that word to a list
    pos = tokened_text.index(w)                     #let "pos" be the position of the word being considered
    for i in range(1,len(tokened_text)-pos):        #consider the next word
        if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}:   #if it's one of these
            allit_str.append(tokened_text[pos+i])   #add it to the list
            i=+1                                    #and move on to the next word
        elif tokened_text[pos+i][0] == letter:      #or else, if the first letter is the same
            allit_str.append(tokened_text[pos+i])   #add the word to the list
            i=+1                                    #and move on to the next word
        else:                                       #or else, if the letter is different
            break                                   #break the for loop
    if len(allit_str)>=2:                           #if the list has two or more members
        print(allit_str)                            #print it

哪个输出

['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']

这接近我想要的,除了我不知道如何限制程序只打印最大序列。

所以我的问题是:

  1. 如何修改此代码以仅打印最大序列 ['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']?
  2. 在 Python 中是否有更简单的方法来执行此操作,可能是使用正则表达式或更优雅的代码?

以下是其他地方提出的类似问题,但没有帮助我修改代码:

(我也认为在这个网站上回答这个问题会很好。)

【问题讨论】:

  • 为避免重复,只扫描字符串一次。摆脱 for 循环并使用索引来扫描字符串。跟踪最后一个未被忽略的单词及其第一个字母的索引。当您找到一个首字母不同的单词时,请确定您是否有足够长的序列来打印。
  • 另外你当前的代码有问题:如果一个词在一个句子中出现两次,tokened_text.index() 总是会找到第一个位置。

标签: regex string python-3.x nltk


【解决方案1】:

有趣的任务。就个人而言,我会在不使用索引的情况下循环遍历,跟踪前一个单词以将其与当前单词进行比较。

此外,仅比较字母是不够的;你必须考虑到's'和'sh'等不要头韵。这是我的尝试:

import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
import string
from collections import defaultdict, OrderedDict
import operator

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon."

# Get the English alphabet as a list of letters
letters = [letter for letter in string.ascii_lowercase] 

# Here we add some extra phonemes that are distinguishable in text.
# ('sailboat' and 'shark' don't alliterate, for instance)
# Digraphs go first as we need to try matching these before the individual letters,
# and break out if found.
sounds = ["ch", "ph", "sh", "th"] + letters 

# Use NLTK's built in stopwords and add "'s" to them
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here
stopwords = set(stopwords) # sets are MUCH faster to process

sents = sent_tokenize(raw)

alliterating_sents = defaultdict(list)
for sent in sents:
    tokenized_sent = word_tokenize(sent)

    # Create list of alliterating word sequences
    alliterating_words = []
    previous_initial_sound = ""
    for word in tokenized_sent:
        for sound in sounds:
            if word.lower().startswith(sound): # only lowercasing when comparing retains original case
                initial_sound = sound
                if initial_sound == previous_initial_sound:
                    if len(alliterating_words) > 0:
                        if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration)
                            alliterating_words.append(word)
                        else:
                            alliterating_words.append(previous_word)
                            alliterating_words.append(word)
                    else:
                        alliterating_words.append(previous_word)
                        alliterating_words.append(word)                
                break # Allows us to treat sh/s distinctly

        # This needs to be at the end of the loop
        # It sets us up for the next iteration
        if word not in stopwords: # ignores stopwords for the purpose of determining alliteration
            previous_initial_sound = initial_sound
            previous_word = word

    alliterating_sents[len(alliterating_words)].append(sent)

sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True))

# OUTPUT
print ("A sorted ordered dict of sentences by number of alliterations:")
print (sorted_alliterating_sents)
print ("-" * 15)
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration 
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])

这会生成一个排序有序的句子字典,其头韵数作为其键。 max_key 变量包含最高头韵句子或句子的计数,可用于访问句子本身。

【讨论】:

    【解决方案2】:

    接受的答案非常全面,但我建议使用卡内基梅隆大学的发音词典。这部分是因为它使生活更轻松,部分是因为相同的 sounding 音节不一定是相同的字母到字母也被视为头韵。我在网上找到的一个例子 (https://examples.yourdictionary.com/alliteration-examples.html) 是“芬恩爱上了菲比”。

    # nltk.download('cmudict') ## download CMUdict for phoneme set
    # The phoneme dictionary consists of ARPABET which encode
    # vowels, consonants, and a representitive stress-level (wiki/ARPABET)
    phoneme_dictionary = nltk.corpus.cmudict.dict()
    stress_symbols = ['0', '1', '2', '3...', '-', '!', '+', '/',
                          '#', ':', ':1', '.', ':2', '?', ':3']
    
    # nltk.download('stopwords') ## download stopwords (the, a, of, ...)
    # Get stopwords that will be discarded in comparison
    stopwords = nltk.corpus.stopwords.words("english")
    # Function for removing all punctuation marks (. , ! * etc.)
    no_punct = lambda x: re.sub(r'[^\w\s]', '', x)
    
    def get_phonemes(word):
        if word in phoneme_dictionary:
            return phoneme_dictionary[word][0] # return first entry by convention
        else:
            return ["NONE"] # no entries found for input word
    
    def get_alliteration_level(text): # alliteration based on sound, not only letter!
        count, total_words = 0, 0
        proximity = 2 # max phonemes to compare to for consideration of alliteration
        i = 0 # index for placing phonemes into current_phonemes
        lines = text.split(sep="\n")
        for line in lines:
            current_phonemes = [None] * proximity
            for word in line.split(sep=" "):
                word = no_punct(word) # remove punctuation marks for correct identification
                total_words += 1
                if word not in stopwords:
                    if (get_phonemes(word)[0] in current_phonemes): # alliteration occurred
                        count += 1
                    current_phonemes[i] = get_phonemes(word)[0] # update new comparison phoneme
                    i = 0 if i == 1 else 1 # update storage index
    
        alliteration_score = count / total_words
        return alliteration_score
    

    以上是建议的脚本。引入了变量proximity,以便我们在头韵中考虑音节,否则这些音节会被多个单词分隔。 stress_symbols 变量反映了 CMU 字典中指示的压力水平,它可以很容易地合并到函数中。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-05-25
      • 1970-01-01
      相关资源
      最近更新 更多