【问题标题】:How to extract unstructured data file onto json object如何将非结构化数据文件提取到json对象中
【发布时间】:2019-09-26 23:18:15
【问题描述】:

在这里需要一些建议。我有一个文本文件,其中包含一些需要提取并保存为 JSON 文件的信息。该文件在块中是非结构化的。请在下面找到:

我怎样才能做到这一点?我只是不知道如何开始。 我有找到类型:路由器的想法,但是我如何迭代每个块并且只选择 P-2-P 块详细信息。谢谢你的建议。

Type      : Router
  Ls id     : 1.1.1.2
  Adv rtr   : 1.1.1.2  
  Ls age    : 201 
  Len       : 84   
  Link count: 5
   * Link ID: 1.1.1.2    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.4    
     Data   : 192.168.100.34  
     Link Type: P-2-P        
     Metric : 1
   * Link ID: 192.168.100.33  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.1    
     Data   : 192.168.100.53  
     Link Type: P-2-P        
     Metric : 1
   * Link ID: 192.168.100.54  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium

  Type      : Router
  Ls id     : 1.1.1.1
  Adv rtr   : 1.1.1.1  
  Ls age    : 1699 
  Len       : 96 
  Options   :  ASBR  E  
  seq#      : 80008d72 
  chksum    : 0x16fc
  Link count: 6
   * Link ID: 1.1.1.1    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.1    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 12 
     Priority : Medium
   * Link ID: 1.1.1.3    
     Data   : 192.168.100.26  
     Link Type: P-2-P        
     Metric : 10
   * Link ID: 192.168.100.25  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 10 
     Priority : Medium
   * Link ID: 1.1.1.2    
     Data   : 192.168.100.54  
     Link Type: P-2-P        
     Metric : 10
   * Link ID: 192.168.100.53  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 10 
     Priority : Medium

仅提取具有类型:路由器的每个块。在此块中要捕获的信息是:

(1)Ls id  : 1.1.1.2
and under link count, info to capture is block that only have link type:P-2-P
(a)Link ID: 1.1.1.4   
(b)Data   : 192.168.100.34 

(c)Link Type: P-2-P 

(d)Metric : 1

(a)Link ID: 1.1.1.3    
(b)Data   : 192.168.100.53  
(c)Link Type: P-2-P    
(d)Metric : 1

Then for another Type: Router block. To capture
(2)Ls id  : 1.1.1.1
and under link count, info to capture is block that only have link type:P-2-P
(a)Link ID: 1.1.1.3   
(b)Data   : 192.168.100.26 
(c)Link Type: P-2-P 
(d)Metric : 10

(a)Link ID: 1.1.1.2    
(b)Data   : 192.168.100.54  
(c)Link Type: P-2-P    
(d)Metric : 10

**There is another Link Type (StubNet) but the only interested to capture is block that have Link Type:P-2-P**

JSON 格式如下:

{
  "oppf": [
    {
      "Sid": "1.1.1.2",
      "Did": "1.1.1.4",
      "Sport": " 192.168.100.34",
      "Netype": "P-2-P",
      "Metric": "1"
    },
    {
      "Sid": "1.1.1.2",
      "Did": "1.1.1.1",
      "Sport": " 192.168.100.53",
      "Netype": "P-2-P",
      "Metric": "1"
    },
    {
      "Sid": "1.1.1.1",
      "Did": "1.1.1.3",
      "Sport": " 192.168.100.26",
      "Netype": "P-2-P",
      "Metric": "10"
    },
    {
      "Sid": "1.1.1.1",
      "Did": "1.1.1.2",
      "Sport": " 192.168.100.54",
      "Netype": "P-2-P",
      "Metric": "10"
    }
   ],
}

【问题讨论】:

  • 它的结构很好 - 你有不同的缩进来识别子列表和* 来识别新字典的开始,以及空行来识别新路线。您还可以: 拆分行并获取键和值。

标签: python json string text


【解决方案1】:

仅获取 P-2-P 类型:

data = "..."

import json
result = {}
l = []
for i in data.split("\n\n"):
    if i:
        p = [parameter for parameter in i.split("*")]
        for line, x in enumerate(p[0].split("\n")):
            if x and "Ls id" in x:
                ls_id, ip = x.split(": ")
                ls_id = ls_id.strip()
                ip = ip.strip()
        for y in p[1:]:
            if y and "P-2-P" in y:
                temp = {ls_id:ip}
                for items in y.split("\n"):
                    try:
                        key, value = items.split(": ")
                        key = key.strip()
                        value = value.strip()
                        temp[key] = value
                    except ValueError:
                       pass
                l.append(temp)
result["oppf"] = l
print (json.dumps(result,indent=2))

【讨论】:

    【解决方案2】:

    对我来说,它的结构很好。它有不同的缩进来识别子项目和* 来识别新字典的开始,以及空行来识别新的路线。它也有: 来分割行并获取键和值。

    data = '''  Type      : Router
      Ls id     : 1.1.1.2
      Adv rtr   : 1.1.1.2  
      Ls age    : 201 
      Len       : 84   
      Link count: 5
       * Link ID: 1.1.1.2    
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 1 
         Priority : Medium
       * Link ID: 1.1.1.4    
         Data   : 192.168.100.34  
         Link Type: P-2-P        
         Metric : 1
       * Link ID: 192.168.100.33  
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 1 
         Priority : Medium
       * Link ID: 1.1.1.1    
         Data   : 192.168.100.53  
         Link Type: P-2-P        
         Metric : 1
       * Link ID: 192.168.100.54  
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 1 
         Priority : Medium
    
      Type      : Router
      Ls id     : 1.1.1.1
      Adv rtr   : 1.1.1.1  
      Ls age    : 1699 
      Len       : 96 
      Options   :  ASBR  E  
      seq#      : 80008d72 
      chksum    : 0x16fc
      Link count: 6
       * Link ID: 1.1.1.1    
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 1 
         Priority : Medium
       * Link ID: 1.1.1.1    
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 12 
         Priority : Medium
       * Link ID: 1.1.1.3    
         Data   : 192.168.100.26  
         Link Type: P-2-P        
         Metric : 10
       * Link ID: 192.168.100.25  
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 10 
         Priority : Medium
       * Link ID: 1.1.1.2    
         Data   : 192.168.100.54  
         Link Type: P-2-P        
         Metric : 10
       * Link ID: 192.168.100.53  
         Data   : 255.255.255.255 
         Link Type: StubNet      
         Metric : 10 
         Priority : Medium'''
    
    results = []
    group = {}
    group['items'] = []
    subgroup = None
    
    for line in data.split('\n'):
        if not line.strip():
            results.append(group)
            group = {}
            group['items'] = []
            subgroup = None
        elif not line.startswith('   '):
            key, val = line.split(':')
            key = key.strip()
            val = val.strip()
            group[key] = val
        else:
            if '*' in line:
                if subgroup:
                    group['items'].append(subgroup)
                subgroup = {}
            key, val = line.split(':')
            key = key.replace('*', '').strip()
            val = val.strip()
            subgroup[key] = val
    
    group['items'].append(subgroup)            
    results.append(group)
    
    print(results)
    

    并很好地展示它

    import json    
    print(json.dumps(results, indent=2))
    

    结果:

    [
      {
        "items": [
          {
            "Link ID": "1.1.1.2",
            "Data": "255.255.255.255",
            "Link Type": "StubNet",
            "Metric": "1",
            "Priority": "Medium"
          },
          {
            "Link ID": "1.1.1.4",
            "Data": "192.168.100.34",
            "Link Type": "P-2-P",
            "Metric": "1"
          },
          {
            "Link ID": "192.168.100.33",
            "Data": "255.255.255.255",
            "Link Type": "StubNet",
            "Metric": "1",
            "Priority": "Medium"
          },
          {
            "Link ID": "1.1.1.1",
            "Data": "192.168.100.53",
            "Link Type": "P-2-P",
            "Metric": "1"
          }
        ],
        "Type": "Router",
        "Ls id": "1.1.1.2",
        "Adv rtr": "1.1.1.2",
        "Ls age": "201",
        "Len": "84",
        "Link count": "5"
      },
      {
        "items": [
          {
            "Link ID": "1.1.1.1",
            "Data": "255.255.255.255",
            "Link Type": "StubNet",
            "Metric": "1",
            "Priority": "Medium"
          },
          {
            "Link ID": "1.1.1.1",
            "Data": "255.255.255.255",
            "Link Type": "StubNet",
            "Metric": "12",
            "Priority": "Medium"
          },
          {
            "Link ID": "1.1.1.3",
            "Data": "192.168.100.26",
            "Link Type": "P-2-P",
            "Metric": "10"
          },
          {
            "Link ID": "192.168.100.25",
            "Data": "255.255.255.255",
            "Link Type": "StubNet",
            "Metric": "10",
            "Priority": "Medium"
          },
          {
            "Link ID": "1.1.1.2",
            "Data": "192.168.100.54",
            "Link Type": "P-2-P",
            "Metric": "10"
          },
          {
            "Link ID": "192.168.100.53",
            "Data": "255.255.255.255",
            "Link Type": "StubNet",
            "Metric": "10",
            "Priority": "Medium"
          }
        ],
        "Type": "Router",
        "Ls id": "1.1.1.1",
        "Adv rtr": "1.1.1.1",
        "Ls age": "1699",
        "Len": "96",
        "Options": "ASBR  E",
        "seq#": "80008d72",
        "chksum": "0x16fc",
        "Link count": "6"
      }
    ]
    

    所以现在你有了 Python 结构,你可以得到你想要的。

    【讨论】:

      猜你喜欢
      • 2014-10-20
      • 2011-03-10
      • 1970-01-01
      • 2021-06-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-10-02
      相关资源
      最近更新 更多