【问题标题】:Python TextBlob translate issuePython TextBlob 翻译问题
【发布时间】:2019-03-14 17:41:26
【问题描述】:

我正在使用 Python、TextBlob 和 NLTK 做一个快速情绪分析控制台应用程序。

目前我正在使用指向西班牙语 wiki 文章的链接,所以我不需要翻译它,我可以使用 nltk 西班牙语停用词列表,但是如果我想让这段代码适用于不同的语言链接怎么办?

如果我使用textFinal=TextBlob(texto) 下方的TextFinal=TextFinal.translate(to="es") 行(下面的代码),我会收到一个错误,因为它无法将西班牙语翻译成西班牙语。

我可以通过使用 try/catch 来防止这种情况吗?有没有办法让代码尝试翻译成不同的语言(以及使用不同的停用词列表)取决于我提供给应用程序的链接的语言?

import nltk
nltk.download('stopwords')
from nltk import  word_tokenize
from nltk.corpus import stopwords
import string
from textblob import TextBlob, Word
import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('https://es.wikipedia.org/wiki/Valencia')
html = response.read()

soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)


tokens = word_tokenize(text)
tokens = [w.lower() for w in tokens]

table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
words = [word for word in stripped if word.isalpha()]

stop_words = set(stopwords.words('spanish'))

words = [w for w in words if not w in stop_words]

with open('palabras.txt', 'w') as f:
    for word in words:
        f.write(" " + word)

with open('palabras.txt', 'r') as myfile:
    texto=myfile.read().replace('\n', '')


textFinal=TextBlob(texto)

print (textFinal.sentiment)

freq = nltk.FreqDist(words)

freq.plot(20, cumulative=False)

【问题讨论】:

    标签: python nltk sentiment-analysis textblob


    【解决方案1】:

    看一下 langdetect 包。您可以检查您输入的页面的语言,如果页面语言与翻译语言匹配,则跳过翻译。类似于以下内容:

    import string
    import urllib.request
    
    import nltk
    from bs4 import BeautifulSoup
    from langdetect import detect
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    from textblob import TextBlob, Word
    
    nltk.download("stopwords")
    # nltk.download("punkt")
    
    response = urllib.request.urlopen("https://es.wikipedia.org/wiki/Valencia")
    html = response.read()
    
    soup = BeautifulSoup(html, "html5lib")
    text = soup.get_text(strip=True)
    lang = detect(text)
    
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens]
    
    table = str.maketrans("", "", string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    words = [word for word in stripped if word.isalpha()]
    
    stop_words = set(stopwords.words("spanish"))
    
    words = [w for w in words if w not in stop_words]
    
    with open("palabras.txt", "w", encoding="utf-8") as f:
        for word in words:
            f.write(" " + word)
    
    with open("palabras.txt", "r", encoding="utf-8") as myfile:
        texto = myfile.read().replace("\n", "")
    
    
    textFinal = TextBlob(texto)
    
    translate_to = "es"
    if lang != translate_to:
        textFinal = textFinal.translate(to=translate_to)
    
    print(textFinal.sentiment)
    
    freq = nltk.FreqDist(words)
    
    freq.plot(20, cumulative=False)
    

    【讨论】:

    • 这很有用。我想我也可以设置一个条件来更改停用词列表语言。谢谢!
    • 没问题,你有有趣的项目。您在原始问题中提到您想将翻译行包装在 try catch 块中,这也可以,但是当您提到其他语言时,似乎 langdetect 会有所帮助。
    猜你喜欢
    • 1970-01-01
    • 2016-05-27
    • 1970-01-01
    • 2021-04-19
    • 2014-04-28
    • 1970-01-01
    • 1970-01-01
    • 2013-05-03
    • 2012-10-27
    相关资源
    最近更新 更多