【问题标题】:Conditional Frequency Distribution using Browns Corpus NLTK Python使用 Browns Corpus NLTK Python 的条件频率分布
【发布时间】:2020-06-26 22:47:57
【问题描述】:

我正在尝试确定以“ing”或“ed”结尾的单词。计算条件频率分布,其中条件是 ['government', 'hobbies'] 并且事件是 'ing' 或 'ed'。将条件频率分布存储在变量 inged_cfd 中。

下面是我的代码:-

from nltk.corpus import brown
import nltk

genre_word = [ (genre, word.lower())
              for genre in ['government', 'hobbies']
              for word in brown.words(categories = genre) if (word.endswith('ing') or word.endswith('ed')) ]
            
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing'):
      wd[1] = 'ing'
    elif wd[1].endswith('ed'):
      wd[1] = 'ed'
      
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
        
inged_cfd.tabulate(conditions = ['government', 'hobbies'], samples = ['ed','ing'])

我想以表格格式输出,使用上面的代码我得到的输出是:-

            ed  ing 
government 2507 1605 
   hobbies 2561 2262

而实际输出是:-

            ed  ing 
government 2507 1474 
   hobbies 2561 2169

请解决我的问题,并帮助我获得准确的输出。

【问题讨论】:

    标签: python-3.x nltk corpus


    【解决方案1】:

    需要排除停用词。此外,在检查条件结束时,将大小写更改为较低。工作代码如下:

    from nltk.corpus import brown
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english')) 
    genre_word = [ (genre, word.lower()) 
    for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
    genre_word_list = [list(x) for x in genre_word]
    
    for wd in genre_word_list:
        if wd[1].endswith('ing') and wd[1] not in stop_words:
            wd[1] = 'ing'
        elif wd[1].endswith('ed') and wd[1] not in stop_words:
            wd[1] = 'ed'
      
    inged_cfd = nltk.ConditionalFreqDist(genre_word_list)    
    inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])
    

    【讨论】:

      【解决方案2】:

      在这两个地方使用相同的 cfdconditions 变量会产生问题。实际上,在 python 中,所有内容都作为对象引用工作,因此当您第一次使用 cfdconditions 时,它可能会在您传递到 cdev_cfd.tabulate 时发生更改,而当您下次传递时,它会作为已更改的对象传递。如果您再初始化一个列表,然后将其传递给第二个调用,那就更好了。

      这是我的修改

      from nltk.corpus import brown
      
      from nltk.corpus import stopwords
      
      def calculateCFD(cfdconditions, cfdevents):
          stop_words= stopwords.words('english')
          at=[i for i in cfdconditions]
          nt = [(genre, word.lower())
                for genre in cfdconditions
                for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]
      
          cdv_cfd = nltk.ConditionalFreqDist(nt)
          cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
          nt1 = [(genre, word.lower())
                for genre in cfdconditions
                for word in brown.words(categories=genre) ]
          
          temp =[]
          for we in nt1:
              wd = we[1]
              if wd[-3:] == 'ing' and wd not in stop_words:
                  temp.append((we[0] ,'ing'))
      
              if wd[-2:] == 'ed':
                  temp.append((we[0] ,'ed'))
              
      
          inged_cfd = nltk.ConditionalFreqDist(temp)
          a=['ed','ing']
          inged_cfd.tabulate(conditions=at, samples=a)
      

      希望对你有帮助!

      【讨论】:

        【解决方案3】:

        预期输出是 -

                         many years 
        
                fiction    29    44 
        
              adventure    24    32 
        
        science_fiction    11    16 
        
                          ed  ing 
        
                fiction 2943 1767 
        
              adventure 3281 1844 
        
        science_fiction  574  293 
        

                          good    bad better 
        
              adventure     39      9     30 
        
                fiction     60     17     27 
        
        science_fiction     14      1      4 
        
                mystery     45     13     29 
        
                          ed  ing 
        
              adventure 3281 1844 
        
                fiction 2943 1767 
        
        science_fiction  574  293 
        
                mystery 2382 1374 
        

        【讨论】:

        • ishan Kankane 分享了上面的代码,并且运行良好。我注意到的不同之处在于 1) isalpha() 的使用(尽管它没有在问题中提及) - 尝试添加它 2)同时生成列表('ing' 和 'ed') - 通常我看到它是元组列表...但是在代码中我们使用列表列表(也尝试转换它)3)同时在 If 条件中生成(流派,单词) - 他没有使用 if word.lower() 而不是停用词,他只是在单词不在停用词中时使用 - 也试试这个
        【解决方案4】:

        我用过这种方法,代码行数更少,速度更快

        from nltk.corpus import brown
        from nltk.corpus import stopwords
            stop_words = set(stopwords.words('english'))
            cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
                  for word in brown.words(categories=genre) if word.lower() not in stop_words])
            
            inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower()) 
                                                  for genre in conditions for word in brown.words(categories=genre) 
                                                  if word.lower() not in stop_words and  (word.lower().endswith('ing') or word.lower().endswith('ed'))])
            
            cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
            
            inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])
        

        【讨论】:

          【解决方案5】:
          from nltk.corpus import stopwords,brown
          def calculateCFD(cfdconditions, cfdevents):
              # Write your code here
              stop_words=set(stopwords.w`enter code here`ords("english"))
              list1=[(genre,word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop_words]
              cfd1=nltk.ConditionalFreqDist(list1)
              cfd1_tabulate=cfd1.tabulate(conditions=cfdconditions,samples=cfdevents)
              #print(cfd1_tabulate)
              
              list2=[[genre,word.lower()] for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop_words if (word.lower().endswith("ed") or word.lower().endswith("ing"))]
              for elem in list2:
                  if elem[1].endswith("ed"):
                      elem[1]="ed"
                  else:
                      elem[1]="ing"
                      
              cfd2=nltk.ConditionalFreqDist(list2)
              cfd2_tabulate=cfd2.tabulate(conditions=cfdconditions,samples=["ed","ing"])
              #print(cfd2_tabulate)
              
              return cfd1_tabulate,cfd2_tabulate
          

          【讨论】:

          • 您好,欢迎来到 SO 社区!我们总是鼓励您添加一些文本来解释您的代码的作用,而不是自己粘贴!
          猜你喜欢
          • 1970-01-01
          • 2012-04-19
          • 1970-01-01
          • 1970-01-01
          • 2017-11-12
          • 2014-07-08
          • 1970-01-01
          • 2020-02-08
          • 1970-01-01
          相关资源
          最近更新 更多