Huggingface 在情绪分析任务中给出 pytorch 索引错误答案

【问题标题】：Huggingface giving pytorch index error on sentiment analysis taskHuggingface 在情绪分析任务中给出 pytorch 索引错误
【发布时间】：2021-12-16 04:17:11
【问题描述】：

我正在尝试对服务器上包含数百万条推文的数据集进行情绪分析。我正在调用一个 API 预测函数，该函数获取 100 条推文的列表并遍历每条推文的测试以返回拥抱脸情绪值，并将该情绪写入 solr 数据库。但是，经过几百条推文的处理后，我收到以下错误，有什么建议吗？

API 代码：

from transformers import pipeline   

model = pipeline(task = 'sentiment-analysis',model="finiteautomata/bertweet-base-sentiment-analysis")

# huggingface sentiment analyser        
def huggingface_sent(sentence):
    text=preprocess(sentence)
    if (len(text)>0):
        predicted_dic = {'NEG': 'Negative','NEU':'Neutral', 'POS':'Positive'}
        return predicted_dic[model(text)[0]['label']]
    else:
        return 'Neutral'


def predict_list(tweets):
    print('Data Processing\n')
    predictions={}
    for t_id in tweets.keys():
        if(tweets[t_id]['language']=='en'):
            predictions[t_id] = huggingface_sent(str(tweets[t_id]['full_text']))
        else:
            predictions[t_id]='NoneEnglish'
            
    print('processed ', len(tweets.keys()))
    print('\n first element is ', predictions[t_id])
    return predictions




print('Running analyser ....\n')

错误日志：

令牌索引序列长度大于指定的最大值该模型的序列长度（211 > 128）。运行这个序列通过模型会导致索引错误 [2021-11-01 12:24:20,649] 应用程序中的错误：/api/predict [POST] Traceback 上的异常（最近一次通话最后）：文件 “/myusername/anaconda3/lib/python3.8/site-packages/flask/app.py”，行第2447章 response = self.full_dispatch_request() 文件“/myusername/anaconda3/lib/python3.8/site-packages/flask/app.py”，行 1952 年，在 full_dispatch_request 中 rv = self.handle_user_exception(e) 文件“/myusername/anaconda3/lib/python3.8/site-packages/flask/app.py”，行 1821，在句柄_用户_异常中 reraise（exc_type，exc_value，tb）文件“/myusername/anaconda3/lib/python3.8/site-packages/flask/_compat.py”，第 39 行，在再加注中提高价值文件“/myusername/anaconda3/lib/python3.8/site-packages/flask/app.py”，行 1950 年，在 full_dispatch_request 中 rv = self.dispatch_request() 文件“/myusername/anaconda3/lib/python3.8/site-packages/flask/app.py”，行 1936 年，在 dispatch_request 中返回 self.view_functionsrule.endpoint 文件“/mnt/raid1/diil/sentiment_api/analyser_main.py”，第 11 行，在 api_predict_list 预测 = predict_list(tweets) 文件“/mnt/raid1/diil/sentiment_api/analysisr_core.py”，第 84 行，在预测列表预测[t_id] = huggingface_sent(str(tweets[t_id]['full_text'])) 文件 “/mnt/raid1/diil/sentiment_api/analyser_core.py”，第 70 行，在拥抱脸_发送如果模型（文本）：文件“/myusername/anaconda3/lib/python3.8/site-packages/transformers/pipelines/text_classification.py”，第 126 行，在调用 return super().call(*args, **kwargs) File "/myusername/anaconda3/lib/python3.8/site-packages/transformers/pipelines/base.py", 第 915 行，在调用返回 self.run_single(inputs, preprocess_params, forward_params, postprocess_params) 文件 "/myusername/anaconda3/lib/python3.8/site-packages/transformers/pipelines/text_classification.py", 第 172 行，在 run_single 返回 [super().run_single(inputs, preprocess_params, forward_params, postprocess_params)] 文件 "/myusername/anaconda3/lib/python3.8/site-packages/transformers/pipelines/base.py", 第 922 行，在 run_single 中 model_outputs = self.forward(model_inputs, **forward_params) 文件 "/myusername/anaconda3/lib/python3.8/site-packages/transformers/pipelines/base.py", 第 871 行，向前 model_outputs = self._forward(model_inputs, **forward_params) 文件 "/myusername/anaconda3/lib/python3.8/site-packages/transformers/pipelines/text_classification.py", 第 133 行，在 _forward 返回 self.model(**model_inputs) 文件“/myusername/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py”，第 1051 行，在 _call_impl 中返回 forward_call(*input, **kwargs) 文件“/myusername/anaconda3/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py”，第 1198 行，向前输出= self.roberta（文件“/myusername/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py”，第 1051 行，在 _call_impl 中返回 forward_call(*input, **kwargs) 文件“/myusername/anaconda3/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py”，第 841 行，向前 embedding_output = self.embeddings（文件“/myusername/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py”，第 1051 行，在 _call_impl 中返回 forward_call(*input, **kwargs) 文件“/myusername/anaconda3/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py”，第 136 行，向前 position_embeddings = self.position_embeddings(position_ids) 文件 "/myusername/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", 第 1051 行，在 _call_impl 中返回 forward_call(*input, **kwargs) 文件“/myusername/anaconda3/lib/python3.8/site-packages/tousername/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py”，第 2043 行，在嵌入中 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in selfusername/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", 第 2043 行，在嵌入中 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self

【问题讨论】：

令牌索引序列长度大于此模式指定的最大序列长度可能意味着句子/文本太长？

标签： python pytorch sentiment-analysis huggingface-transformers bert-language-model

【解决方案1】：

正如@Quang Hoang 在评论中提到的那样，问题似乎是由于您输入的推文的长度。幸运的是，您能够确定 tokenizer 在 pipeline 类中的行为并显式截断较长的推文。此外，还可以为管道元素设置任何其他参数。

MODEL_CHECKPOINT = "finiteautomata/bertweet-base-sentiment-analysis"
ner_pipeline = pipeline(task="sentiment-analysis", tokenizer=(MODEL_CHECKPOINT, {'model_max_length': 128}), model="finiteautomata/bertweet-base-sentiment-analysis")

作为旁注，我建议使用this 回答中提出的方法来加速整个过程。

【讨论】：

感谢 Meti，您知道如何使用 GPU 和 huggingface 情绪分析器来加速情绪推理过程吗？
请注意，我使用 gunicorn 运行它
我使用 uWSGI + docker 容器 + GPU 支持在管道 @Youcef 实例化时使用参数 device=0 提供此服务
谢谢梅蒂。可以提供线路代码吗？