【问题标题】:How can I load torchtext dataset for machine translation task in pytorch?如何在 pytorch 中为机器翻译任务加载 torchtext 数据集?
【发布时间】:2021-09-24 14:56:26
【问题描述】:

我是torchtext 的新手,我一直在使用Multi30k 数据集来学习基础知识。在学习基础知识的过程中,我想使用其他数据集,例如IWSLT2017。我阅读了文档,他们向我展示了如何加载数据。

这就是我加载Multi30k 数据集的方式

# creating the fields

SRC = data.Field(
    tokenize = tokenize_de,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)
TRG = data.Field(
    tokenize = tokenize_en,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)

### Splitting the sets
train_data, valid_data, test_data = datasets.Multi30k.splits(
    exts=('.de', '.en'),
    fields = (SRC, TRG)
)

当我运行这个时:

print(vars(train_data.examples[0]))

我明白了:

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}

我的问题是当我调用print(vars(train_data.examples[0])) 时如何加载IWSLT2017 以获得类似的结果。

这是我尝试过的:

from torchtext.datasets import IWSLT2017
train_iter, valid_iter, test_iter = IWSLT2017(
    root='.data', split=('train', 'valid', 'test'), language_pair=('it', 'en')
)
src_sentence, tgt_sentence = next(train_iter)

它返回一个元组,如下所示:

('Sono impressionato da questa conferenza, e voglio ringraziare tutti voi per i tanti, lusinghieri commenti, anche perché... Ne ho bisogno!!!\n',
 'I have been blown away by this conference, and I want to thank all of you for the many nice comments\n')

我的问题是我怎样才能从这一步转移到得到这样的东西的步骤:

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}

任何帮助输入将不胜感激。

【问题讨论】:

    标签: python pytorch torch machine-translation torchtext


    【解决方案1】:

    为此,您可以使用例如 spacy 的 processing_pipeline。 一个示例如下所示:

    import spacy
    from torchtext.data.utils import get_tokenizer
    from torchtext.datasets import IWSLT2017
    
    train_iter, valid_iter, test_iter = IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('it', 'en'))
    
    src_sentence, tgt_sentence = next(train_iter)
    print(src_sentence,tgt_sentence)
    
    nlp = spacy.load("it_core_news_sm")
    for doc in nlp.pipe([src_sentence]):
        # Do something with the doc here
        print([(ent.text) for ent in doc])
    
    nlp = spacy.load("en_core_web_sm")
    for doc in nlp.pipe([tgt_sentence]):
        # Do something with the doc here
        print([(ent.text) for ent in doc])
    

    第一个例句的输出:

    Grazie mille, Chris. E’ veramente un grande onore venire su questo palco due volte. Vi sono estremamente grato.
    Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.
    

    标记化句子的输出:

    ['Grazie', 'mille', ',', 'Chris', '.', 'E', '’', 'veramente', 'un', 'grande', 'onore', 'venire', 'su', 'questo', 'palco', 'due', 'volte', '.', 'Vi', 'sono', 'estremamente', 'grato', '.', '\n']
    ['Thank', 'you', 'so', 'much', ',', 'Chris', '.', 'And', 'it', "'s", 'truly', 'a', 'great', 'honor', 'to', 'have', 'the', 'opportunity', 'to', 'come', 'to', 'this', 'stage', 'twice', ';', 'I', "'m", 'extremely', 'grateful', '.', '\n']
    

    【讨论】:

      猜你喜欢
      • 2020-04-20
      • 2021-01-09
      • 2021-03-31
      • 1970-01-01
      • 2020-01-13
      • 1970-01-01
      • 2017-11-09
      • 2020-06-09
      • 2019-04-24
      相关资源
      最近更新 更多