【发布时间】:2017-02-08 22:45:53
【问题描述】:
我正在使用 NLTK 3.2 使用 Python 3.6。
我正在尝试编写一个程序,它将原始文本作为输入并输出以相同字母开头的任何(最大)连续单词系列(即头韵序列)。
在搜索序列时,我想忽略某些单词和标点符号(例如,'it'、'that'、'into'、''s'、',' 和 '.'),但要包括它们在输出中。
例如输入
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
应该让步
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
我是编程新手,我能想到的最好的方法是:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
哪个输出
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
这接近我想要的,除了我不知道如何限制程序只打印最大序列。
所以我的问题是:
- 如何修改此代码以仅打印最大序列
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']? - 在 Python 中是否有更简单的方法来执行此操作,可能是使用正则表达式或更优雅的代码?
以下是其他地方提出的类似问题,但没有帮助我修改代码:
- How do you effectively use regular expressions to find alliterative expressions?
- A reddit challenge asking for a similar program
- 4chan question regarding counting instances of alliteration
- Blog about finding most common alliterative strings in a corpus
(我也认为在这个网站上回答这个问题会很好。)
【问题讨论】:
-
为避免重复,只扫描字符串一次。摆脱 for 循环并使用索引来扫描字符串。跟踪最后一个未被忽略的单词及其第一个字母的索引。当您找到一个首字母不同的单词时,请确定您是否有足够长的序列来打印。
-
另外你当前的代码有问题:如果一个词在一个句子中出现两次,
tokened_text.index()总是会找到第一个位置。
标签: regex string python-3.x nltk