在管道中使用带有参数的 Hugging-face 转换器答案

【问题标题】：Using Hugging-face transformer with arguments in pipeline在管道中使用带有参数的 Hugging-face 转换器
【发布时间】：2021-11-10 18:31:22
【问题描述】：

我正在使用变压器。将 BERT 嵌入到我的输入的管道。在没有管道的情况下使用它我能够获得恒定的输出，但不能使用管道，因为我无法将参数传递给它。

如何为我的管道传递与转换器相关的参数？

# These are BERT and tokenizer definitions
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = ['hello world']

# Normally I would do something like this to initialize the tokenizer and get the result with constant output
tokens = tokenizer(inputs,padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model(**tokens)[0].detach().numpy().shape


# using the pipeline 
pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)

# or other option
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

nlp=pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)

# to call the pipeline
nlp("hello world")

我已经尝试了几种方法，例如上面列出的选项，但无法获得恒定输出大小的结果。可以通过设置标记器参数来实现恒定的输出大小，但不知道如何为管道提供参数。

有什么想法吗？

【问题讨论】：

能否添加inputs 的示例？恒定输出是什么意思？
更新了问题

标签： pytorch huggingface-transformers bert-language-model transformer huggingface-tokenizers

【解决方案1】：

不支持 max_length 标记化参数per default（即未应用对 max_length 的填充），但您可以创建自己的类并覆盖此行为：

from transformers import AutoTokenizer, AutoModel
from transformers import FeatureExtractionPipeline
from transformers.tokenization_utils import TruncationStrategy

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = ['hello world']

class MyFeatureExtractionPipeline(FeatureExtractionPipeline):
      def _parse_and_tokenize(
        self, inputs, max_length, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs
    ):
        """
        Parse arguments and tokenize
        """
        # Parse arguments
        if getattr(self.tokenizer, "pad_token", None) is None:
            padding = False
        inputs = self.tokenizer(
            inputs,
            add_special_tokens=add_special_tokens,
            return_tensors=self.framework,
            padding=padding,
            truncation=truncation,
            max_length=max_length
        )
        return inputs

mynlp = MyFeatureExtractionPipeline(model=model, tokenizer=tokenizer)
o = mynlp("hello world", max_length = 500, padding='max_length', truncation=True)

让我们比较一下输出的大小：

print(len(o))
print(len(o[0]))
print(len(o[0][0]))

输出：

1
500
768

请注意：这仅适用于变形金刚 4.10.X 和以前的版本。该团队目前正在重构管道类，未来的版本将需要进行不同的调整（即，一旦重构的管道发布，这将无法正常工作）。

【讨论】：