使用 Browns Corpus NLTK Python 的条件频率分布答案

【问题标题】：Conditional Frequency Distribution using Browns Corpus NLTK Python使用 Browns Corpus NLTK Python 的条件频率分布
【发布时间】：2020-06-26 22:47:57
【问题描述】：

我正在尝试确定以“ing”或“ed”结尾的单词。计算条件频率分布，其中条件是 ['government', 'hobbies'] 并且事件是 'ing' 或 'ed'。将条件频率分布存储在变量 inged_cfd 中。

下面是我的代码：-

from nltk.corpus import brown
import nltk

genre_word = [ (genre, word.lower())
              for genre in ['government', 'hobbies']
              for word in brown.words(categories = genre) if (word.endswith('ing') or word.endswith('ed')) ]
            
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing'):
      wd[1] = 'ing'
    elif wd[1].endswith('ed'):
      wd[1] = 'ed'
      
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
        
inged_cfd.tabulate(conditions = ['government', 'hobbies'], samples = ['ed','ing'])

我想以表格格式输出，使用上面的代码我得到的输出是：-

            ed  ing 
government 2507 1605 
   hobbies 2561 2262

而实际输出是：-

            ed  ing 
government 2507 1474 
   hobbies 2561 2169

请解决我的问题，并帮助我获得准确的输出。

【问题讨论】：

标签： python-3.x nltk corpus

【解决方案1】：

需要排除停用词。此外，在检查条件结束时，将大小写更改为较低。工作代码如下：

from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 
genre_word = [ (genre, word.lower()) 
for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing') and wd[1] not in stop_words:
        wd[1] = 'ing'
    elif wd[1].endswith('ed') and wd[1] not in stop_words:
        wd[1] = 'ed'
  
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)    
inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])

【讨论】：

【解决方案2】：

在这两个地方使用相同的 cfdconditions 变量会产生问题。实际上，在 python 中，所有内容都作为对象引用工作，因此当您第一次使用 cfdconditions 时，它可能会在您传递到 cdev_cfd.tabulate 时发生更改，而当您下次传递时，它会作为已更改的对象传递。如果您再初始化一个列表，然后将其传递给第二个调用，那就更好了。

这是我的修改

from nltk.corpus import brown

from nltk.corpus import stopwords

def calculateCFD(cfdconditions, cfdevents):
    stop_words= stopwords.words('english')
    at=[i for i in cfdconditions]
    nt = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]

    cdv_cfd = nltk.ConditionalFreqDist(nt)
    cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
    nt1 = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) ]
    
    temp =[]
    for we in nt1:
        wd = we[1]
        if wd[-3:] == 'ing' and wd not in stop_words:
            temp.append((we[0] ,'ing'))

        if wd[-2:] == 'ed':
            temp.append((we[0] ,'ed'))
        

    inged_cfd = nltk.ConditionalFreqDist(temp)
    a=['ed','ing']
    inged_cfd.tabulate(conditions=at, samples=a)

希望对你有帮助！

【讨论】：

【解决方案3】：

预期输出是 -

                 many years 

        fiction    29    44 

      adventure    24    32 

science_fiction    11    16 

                  ed  ing 

        fiction 2943 1767 

      adventure 3281 1844 

science_fiction  574  293

和

                  good    bad better 

      adventure     39      9     30 

        fiction     60     17     27 

science_fiction     14      1      4 

        mystery     45     13     29 

                  ed  ing 

      adventure 3281 1844 

        fiction 2943 1767 

science_fiction  574  293 

        mystery 2382 1374

【讨论】：

ishan Kankane 分享了上面的代码，并且运行良好。我注意到的不同之处在于 1) isalpha() 的使用（尽管它没有在问题中提及） - 尝试添加它 2）同时生成列表（'ing' 和 'ed'） - 通常我看到它是元组列表...但是在代码中我们使用列表列表（也尝试转换它）3）同时在 If 条件中生成（流派，单词） - 他没有使用 if word.lower() 而不是停用词，他只是在单词不在停用词中时使用 - 也试试这个

【解决方案4】：

我用过这种方法，代码行数更少，速度更快

from nltk.corpus import brown
from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
          for word in brown.words(categories=genre) if word.lower() not in stop_words])
    
    inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower()) 
                                          for genre in conditions for word in brown.words(categories=genre) 
                                          if word.lower() not in stop_words and  (word.lower().endswith('ing') or word.lower().endswith('ed'))])
    
    cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
    
    inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])

【讨论】：

【解决方案5】：

from nltk.corpus import stopwords,brown
def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    stop_words=set(stopwords.w`enter code here`ords("english"))
    list1=[(genre,word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop_words]
    cfd1=nltk.ConditionalFreqDist(list1)
    cfd1_tabulate=cfd1.tabulate(conditions=cfdconditions,samples=cfdevents)
    #print(cfd1_tabulate)
    
    list2=[[genre,word.lower()] for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop_words if (word.lower().endswith("ed") or word.lower().endswith("ing"))]
    for elem in list2:
        if elem[1].endswith("ed"):
            elem[1]="ed"
        else:
            elem[1]="ing"
            
    cfd2=nltk.ConditionalFreqDist(list2)
    cfd2_tabulate=cfd2.tabulate(conditions=cfdconditions,samples=["ed","ing"])
    #print(cfd2_tabulate)
    
    return cfd1_tabulate,cfd2_tabulate

【讨论】：

您好，欢迎来到 SO 社区！我们总是鼓励您添加一些文本来解释您的代码的作用，而不是自己粘贴！