1.git clone https://github.com/grangier/python-goose.git

2.cd python-goose

3.sudo pip install -r requirements.txt
此时会报一个安装nltk的错误,执行下面命令单独安装:

sudo apt-get install python-nltk 

4.sudo python setup.py install

 

至此安装完毕!!!!!!!

---------------------------------------------------------

下面付简单的使用demo:

def goose_extraction(response):
    try:

import traceback

        import chardet
        from goose import Goose
        from goose.text import StopWordsChinese
        charset = chardet.detect(response.content)
        coding = charset.get('encoding').lower()  # 网页编码类别:gbk,gb2312,utf-8等
        if coding and coding.startswith(u'gb'):
            codeHtml = response.content.decode("GB18030").encode('utf-8')
        elif coding.startswith(u'utf'):
            codeHtml = response.content
        else:
            codeHtml = response.content.decode(coding, 'ignore')
        g = Goose({'stopwords_class': StopWordsChinese})  # 中文
        article = g.extract(raw_html=codeHtml)
        content = article.cleaned_text
        html = '<div>' + ''.join(['<p>'+con+'</p>\n' for con in content.split('\n\n')]) + '</div>'
        return content, html
    except Exception as e:
        traceback.print_exc(e)

 

相关文章:

  • 2021-11-24
  • 2022-12-23
  • 2021-10-29
  • 2021-08-23
  • 2021-11-10
  • 2022-12-23
  • 2021-12-26
  • 2021-08-08
猜你喜欢
  • 2022-01-07
  • 2021-11-13
  • 2021-05-10
  • 2021-06-08
  • 2022-12-23
  • 2022-12-23
相关资源
相似解决方案