【问题标题】:Splitting a unique string - Python [closed]拆分一个唯一的字符串 - Python [关闭]
【发布时间】:2020-02-15 04:55:01
【问题描述】:

我正在尝试寻找解析此类字符串的最佳方法:

Operating Status: NOT AUTHORIZED Out of Service Date: None

我需要这样的输出:

['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']

有没有简单的方法来做到这一点?我正在解析数百个这样的字符串。没有确定性文本,但始终采用上述格式。

其他字符串示例:

MC/MX/FF Number(s): None  DUNS Number: -- 
Power Units: 1  Drivers: 1 

预期输出:

['MC/MX/FF Number(s): None, 'DUNS Number: --']
['Power Units: 1,  Drivers: 1 ']

【问题讨论】:

  • 只是一种方法,尝试自己保留所有键值的列表,然后继续
  • 这个问题似乎根本没有包括任何解决问题的尝试。请编辑问题以显示您尝试过的内容,并使用Minimal, Complete, and Verifiable example 显示您遇到的特定障碍。欲了解更多信息,请参阅How to Ask
  • 对不起兄弟,但我在可能的字符串之间找不到任何模式。所以我的意见是不知道所有可能的字符串是无法解决的。
  • 您可以拆分 :,但问题是无法知道它应该是 ['Operating Status: NOT AUTHORIZED'、'Out of Service Data: None'] 还是 ['运行状态:未授权停止服务','数据:无']

标签: python python-3.x string parsing text-parsing


【解决方案1】:

有两种方法。两者都是超级笨拙的,并且非常依赖于原始字符串的非常小的波动。但是,您可以修改代码以提供更多的灵活性。

这两个选项都取决于满足这些特征的线路... 有问题的分组必须...

  1. 以字母或斜线开头,可能大写
  2. 感兴趣的标题后跟一个冒号(“:”)
  3. 仅获取冒号后的第一个单词。

方法一,正则表达式,这个只能抓取两块数据。第二组是“其他所有内容”,因为我无法正确重复搜索模式:P

代码:

import re

l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]

pattern = ''.join([
                 "(", # Start capturing group  
                 "\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash 
                 ".+?\:", # any character (non-greedy) up to and including the colon
                 "\s*", # One or more spaces
                 "\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
                  ")", # End capturing group
                  "(.*)"
])

for s in l: 
    m = re.search(pattern, s)
    print("----------------")
    try:
        print(m.group(1))
        print(m.group(2))
        print(m.group(3))
    except Exception as e:
        pass

输出:

----------------
MC/MX/FF Number(s): None 
DUNS Number: -- 
----------------
Power Units: 1 
Drivers: 1 

方法二,逐字解析字符串。该方法与正则表达式基本特征相同,但可以做两个以上感兴趣的块。它的工作原理...

  1. 开始逐字解析每个字符串,并将其加载到 newstring
  2. 当遇到冒号时,标记一个标志。
  3. 将下一个循环中的第一个单词添加到newstring。如果需要,您可以将其更改为 1-2、1-3 或 1-n 字。您也可以让它在设置colonflag 之后继续添加单词,直到满足某些条件,例如带有大写字母的单词……尽管这可能会中断诸如“无”之类的单词。你可以一直到遇到一个全大写的单词,但是一个非全大写的标题会破坏它。
  4. newstring 添加到newlist,重置标志,并继续解析单词。

代码:

s =     'MC/MX/FF Number(s): None DUNS Number: -- ' 
for s in l: 
    newlist = []
    newstring = ""
    colonflag = False
    for w in s.split():
        newstring += " " + w
        if colonflag: 
            newlist.append(newstring)
            newstring = ""
            colonflag = False

        if ":" in w:
            colonflag = True
    print(newlist)

输出:

[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']

第三个选项: 创建所有预期标头的列表,例如header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ] ,并根据这些标头进行拆分/解析。

第四个选项

使用Natural Language Processing 和机器学习来实际找出逻辑句子的位置;)

【讨论】:

    【解决方案2】:

    看看pyparsing。这似乎是表达单词组合、检测它们之间的关系(以语法方式)并产生结构化响应的最“自然”的方式......网上有很多教程和文档:

    您可以使用 `pip install pyparsing' 安装 pyparsing

    解析:

    Operating Status: NOT AUTHORIZED Out of Service Date: None
    

    需要类似的东西:

    !/usr/bin/env python3
    # -*- coding: utf-8 -*-
    #
    #  test_pyparsing2.py
    #
    #  Copyright 2019 John Coppens <john@jcoppens.com>
    #
    #  This program is free software; you can redistribute it and/or modify
    #  it under the terms of the GNU General Public License as published by
    #  the Free Software Foundation; either version 2 of the License, or
    #  (at your option) any later version.
    #
    #  This program is distributed in the hope that it will be useful,
    #  but WITHOUT ANY WARRANTY; without even the implied warranty of
    #  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    #  GNU General Public License for more details.
    #
    #  You should have received a copy of the GNU General Public License
    #  along with this program; if not, write to the Free Software
    #  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
    #  MA 02110-1301, USA.
    #
    #
    
    import pyparsing as pp
    
    def create_parser():
        opstatus = pp.Keyword("Operating Status:")
        auth     = pp.Combine(pp.Optional(pp.Keyword("NOT"))) + pp.Keyword("AUTHORIZED")
        status   = pp.Keyword("Out of Service Date:")
        date     = pp.Keyword("None")
    
        part1    = pp.Group(opstatus + auth)
        part2    = pp.Group(status + date)
    
        return part1 + part2
    
    
    
    def main(args):
        parser = create_parser()
    
        msg = "Operating Status: NOT AUTHORIZED Out of Service Date: None"
        print(parser.parseString(msg))
    
        msg = "Operating Status: AUTHORIZED Out of Service Date: None"
        print(parser.parseString(msg))
    
        return 0
    
    if __name__ == '__main__':
        import sys
        sys.exit(main(sys.argv))
    

    运行程序:

    [['Operating Status:', 'NOT', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
    [['Operating Status:', '', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
    

    使用CombineGroup,您可以更改输出的组织方式。

    【讨论】:

      猜你喜欢
      • 2022-01-08
      • 1970-01-01
      • 1970-01-01
      • 2012-05-10
      • 2021-05-10
      • 1970-01-01
      • 2016-07-05
      • 1970-01-01
      相关资源
      最近更新 更多