【问题标题】:tokenizing text with features in specif format标记具有特定格式特征的文本
【发布时间】:2021-11-03 07:10:54
【问题描述】:

您好,我正在尝试使用以下文本示例创建具有某些功能的令牌并以某种 JSON 格式排列它们:

words = ['The study of aviation safety report in the aviation industry usually relies', 
         'The experimental results show that compared with traditional',
         'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
{"sentence": [
           {
             indexSentence:0,
             tokens: [{
                       "indexWord": 1,
                        "word": "The",
                         "len": 3
                      },
                      { "indexWord": 2,
                        "word": "study",
                         "len": 5},
                      {"indexWord": 3,
                        "word": "of",
                         "len": 2
                       },
                       {"indexWord": 4,
                        "word": "aviation",
                         "len": 8},
                        ...
                        ]
           },
           {
            "indexSentence" : 1,
            "tokens" : [{
                        ...
                        }]
           },
           ....
         ]}

我尝试使用以下代码但没有成功...

t_d = {len(i):i for i in words}

[{'Lon' : len(t_d[i]),
  'tex' : t_d[i], 
  'Sub' : [{'index' : j,
            'token': [{
                      'word':['word: ' + j for i,j in enumerate(str(t_d[i]).split(' '))] 
                      }],
            'lenTo' : len(str(t_d[i]).split(' '))
           }
          ],
  'Sub1':[{'index' : j}]
 } for j,i in enumerate(t_d)]

【问题讨论】:

    标签: python json dictionary nlp token


    【解决方案1】:

    下面的解决方案假设标记化使用str.split 函数按空格分割句子。该解决方案应该仍然能够与任何其他标记化功能一起使用。

    from collections import defaultdict
    
    words = ['The study of aviation safety report in the aviation industry usually relies', 
             'The experimental results show that compared with traditional',
             'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
    
    sentence = defaultdict(list)
    
    for idx,i in enumerate(words):
        struct = {"indexSentence":idx,"tokens":[{"indexWord":idx_w,
                                                 "word":w,
                                                 "len":len(w)} for idx_w, w in enumerate(i.split())]}
        sentence['sentence'].append(struct)
        
    dict(sentence)
    
    >>
    {'sentence': [{'indexSentence': 0,
       'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
        {'indexWord': 1, 'word': 'study', 'len': 5},
        {'indexWord': 2, 'word': 'of', 'len': 2},
        {'indexWord': 3, 'word': 'aviation', 'len': 8},
        {'indexWord': 4, 'word': 'safety', 'len': 6},
        {'indexWord': 5, 'word': 'report', 'len': 6},
        {'indexWord': 6, 'word': 'in', 'len': 2},
        {'indexWord': 7, 'word': 'the', 'len': 3},
        {'indexWord': 8, 'word': 'aviation', 'len': 8},
        {'indexWord': 9, 'word': 'industry', 'len': 8},
        {'indexWord': 10, 'word': 'usually', 'len': 7},
        {'indexWord': 11, 'word': 'relies', 'len': 6}]},
      {'indexSentence': 1,
       'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
    ...
    }
    

    您可以利用defaultdict 首先创建您的列表或数组,然后在顶部附加所需的结构。要模仿 json 结构,您可以返回到 dict

    【讨论】:

    • 谢谢@BernardL,这是我试图找到的解决方案。我将把这个解决方案与几个文件一起使用,并让你知道。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-12-11
    • 2016-11-10
    • 2019-10-20
    • 1970-01-01
    • 2021-08-13
    • 1970-01-01
    • 2019-11-20
    相关资源
    最近更新 更多