【问题标题】:How can I load a dataset into a pytorch torchtext.data.TabularDataset from a json having list of dicts?如何从具有字典列表的 json 将数据集加载到 pytorch torchtext.data.TabularDataset 中?
【发布时间】:2020-12-14 15:18:38
【问题描述】:

我有一个字典列表如下:

[{'text': ['The', 'Fulton', 'County', 'Grand', ...], 'tags': ['AT', 'NP-TL', 'NN-TL', 'JJ-TL', ...]},
 {'text': ['The', 'jury', 'further', 'said', ...], 'tags': ['AT', 'NN', 'RBR', 'VBD', ...]},
 ...]

每个 dict 的每个值都是一个句子单词/标签的列表。这直接来自 NLTK 数据集的布朗语料库,使用以下方式加载:

from nltk.corpus import brown
data = brown.tagged_sents()
data = {'text': [[word for word, tag in sent] for sent in data], 'tags': [[tag for word, tag in sent] for sent in data]}

import pandas as pd
df = pd.DataFrame(training_data, columns=["text", "tags"])

from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.2)
train.to_json("train.json", orient='records')
val.to_json("val.json", orient='records')

我想将此 json 加载到 torchtext.data.TabularDataset 中:

TEXT = data.Field(lower=True)
TAGS = data.Field(unk_token=None)

data_fields = [('text', TEXT), ('tags', TAGS)]
train, val = data.TabularDataset.splits(path='./', train='train.json', validation='val.json', format='json', fields=data_fields)

但它给了我这个错误:

/usr/local/lib/python3.6/dist-packages/torchtext/data/example.py in fromdict(cls, data, fields)
     17     def fromdict(cls, data, fields):
     18         ex = cls()
---> 19         for key, vals in fields.items():
     20             if key not in data:
     21                 raise ValueError("Specified key {} was not found in "

AttributeError: 'list' object has no attribute 'items'

请注意,我不希望 TabularDataset 为我标记句子,因为它已经被 nltk 标记。我该如何处理? (我无法将语料库切换为可以直接从 torchtext.dataset 加载的内容,我必须使用布朗语料库)

【问题讨论】:

    标签: python pytorch nltk torchtext


    【解决方案1】:

    对于那些现在正在查看这个问题的人,请注意它使用的是旧版的 torchtext。您仍然可以使用此功能,但需要添加旧版...例如:

    from torchtext import data
    from torchtext import datasets
    from torchtext import legacy
    
    TEXT = legacy.data.Field()
    TAGS = legacy.data.Field()
    

    然后我建议像这样格式化 data_fields:

    fields = {'text': ('text', TEXT), 'tag': ('tag', TAGS)}
    

    这应该可以解决问题。对于任何使用最新的 torchtext 功能的人来说,这样做的方法是:

    要创建可迭代数据集,您可以使用 _RawTextIterableDataset 函数。这是一个从 json 文件加载的示例:

    def _create_data_from_json(data_path):
        with open(data_path) as json_file:
            raw_json_data = json.load(json_file)
            for item in raw_json_data:
                _label, _paragraph = item['tags'], item['text']
                yield (_tag, _text)
    
    
    #Load torchtext utilities needed to convert (label, paragraph) tuple into iterable dataset               
    from torchtext.data.datasets_utils import (
        _RawTextIterableDataset,
        _wrap_split_argument,
        _add_docstring_header,
        _create_dataset_directory,
    )
    
    #Dictionary of data sources. The train and test data JSON files have items consisting of paragraphs and labels
    DATA_SOURCE = {
        'train': 'data/train_data.json',
        'test': 'data/test_data.json'
    }
    
    #This is the number of lines/items in each data set
    NUM_LINES = {
        'train': 200,
        'test': 100,
    }
    
    #Naming the dataset
    DATASET_NAME = "BAR"
    
    #This function return the iterable dataset based on whatever split is passed in
    @_add_docstring_header(num_lines=NUM_LINES, num_classes=2)
    @_create_dataset_directory(dataset_name=DATASET_NAME)
    @_wrap_split_argument(('train', 'test'))
    def FOO(root, split):
        return _RawTextIterableDataset(DATASET_NAME, NUM_LINES[split],
                                     _create_data_from_json(DATA_SOURCE[split]))
    

    然后您可以调用此函数来返回您的可迭代数据集:

    #Get iterable for train and test data sets
    train_iter, test_iter = FOO(split=('train', 'test'))
    

    _create_data_from_json 函数可以替换为任何从数据源生成元组的函数。

    【讨论】:

      猜你喜欢
      • 2017-11-09
      • 2019-04-02
      • 1970-01-01
      • 2021-02-10
      • 1970-01-01
      • 2023-02-05
      • 2020-09-27
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多