中文词频统计 - 170何强

下载一长篇中文文章。

从文件读取待分析文本。

news = open(\'gzccnews.txt\',\'r\',encoding = \'utf-8\')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20

将代码与运行结果截图发布在博客上。

# -*- coding : UTF-8 -*-
# -*- author : onexiaofeng -*-
import jieba
jieba.add_word(\'路明非\')
news=open(\'longzu.txt\',\'r\',encoding=\'utf-8\')
notes=news.read()
notelist=list(jieba.lcut(notes))

Word={}
for i in set(notelist):    
    Word[i]=notelist.count(i)

delete_word={\'我\',\' \',\'得\',\'；\', \'你\', \'的\', \'他\', \'她\', \'它\', \'的\', \'着\', \'呀\',\'，\',\'。\',\'：\',\'“\',\'”\',\'也\',\'吗\',\'?\',\'被\',\'说\',\
           \'是\',\'使\',\'与\',\'不\',\'是\',\'、\',\'而\',\'又\',\'！\', \'\n\',\'…\',\'？\',\'了\',\'有\',\'在\',\'来\',\'嗯\',\'去\',\'于\',\'人\',\'中\',\'想\',\'却\',\
             \'到\',\'此\',\'叫\',\'便\',\'把\',\'但\',\'若\',\'以\',\'龙\',\'人\',\'已\',\'可\',\'出\',\'被\',\'使\',\'却\',\'都\',\'就\',\'和\',\'上\',\'地\',\'里\',\'们\',\'那\',\'一个\',\'还\',\'很\',\'么\',\'就是\'}

for i in delete_word:        
    if i in Word:
        del Word[i]

sort_word = sorted(Word.items(), key= lambda d:d[1], reverse = True)  
for i in range(20):  
    print(sort_word[i])

截图：