【问题标题】:Extract user id numbers from a .csv file从 .csv 文件中提取用户 ID 号
【发布时间】:2019-04-08 08:22:45
【问题描述】:

我有一个包含用户信息的 csv 文件。该文件的示例如下。

 "userType": "NORMAL",   "accountID": "J123456789"
 "userType": "NORMAL",   "accountID": "J987654321"
 "userType": "NORMAL",   "accountID": "C123456789"
 "userType": "NORMAL",   "accountID": "R987654321"

我想在 python 3 中使用正则表达式获取 id 号。

我使用的正则表达式是("accountID": ")\w+,它会生成以下结果。

"accountID": "J123456789
"accountID": "J987654321
"accountID": "C123456789
"accountID": "R987654321

所需的输出应如下所示,

J987654321
J987654321
C123456789
R987654321

【问题讨论】:

  • 使用"accountID": "(\w+)re.findall
  • 为了感兴趣,请:为什么这个任务需要re
  • 如果你有"accountID": "J123456789",你怎么会期待J987654321

标签: python regex python-3.x


【解决方案1】:

恕我直言,这根本不需要任何导入:

with open('test.csv') as f:
    for line in f:
        print(line.strip()[-11:-1])

或者如果帐户 ID 的长度确实不同,请使用:

        print(line.split('"')[-2])

在循环中。

【讨论】:

    【解决方案2】:

    如果文件格式是固定的,考虑自动检测方言:

    import csv
    
    with open('test.csv') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(1024))
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)
        accounts = [row[2] for row in reader]
    

    此代码将生成以下列表:

    accounts
    ['J000025574', 'J000025620', 'C000025623', 'R000025624']
    

    【讨论】:

      【解决方案3】:

      您可以使用以下正则表达式 "(?:\"accountID\": \")(\S+)\" 仅捕获 ID 并忽略其余部分

      import re
      
      s = """"userType": "NORMAL",   "accountID": "J123456789"
       "userType": "NORMAL",   "accountID": "J987654321"
       "userType": "NORMAL",   "accountID": "C123456789"
       "userType": "NORMAL",   "accountID": "R987654321" """
      
      print(re.findall("(?:\"accountID\": \")(\S+)\"",s))
      

      结果:

      ['J123456789', 'J987654321', 'C123456789', 'R987654321']
      

      【讨论】:

        【解决方案4】:

        你可以给自己写一个解析器(虽然可能有点过头了):

        from parsimonious.grammar import Grammar
        from parsimonious.nodes import NodeVisitor
        
        text = """
         "userType": "NORMAL",   "accountID": "J123456789"
         "userType": "NORMAL",   "accountID": "J987654321"
         "userType": "NORMAL",   "accountID": "C123456789"
         "userType": "NORMAL",   "accountID": "R987654321"
        """
        
        grammar = Grammar(
            r"""
            file        = entry+
        
            entry       = garbage? (pair)+ newline
            pair        = ws? key equal value comma?
        
            key         = quotes word quotes
            value       = quotes word quotes
            quotes      = '"'
            word        = ~"\w+"
            equal       = ws? ":" ws?
            comma       = ws? "," ws?
        
            ws          = ~"[\t ]+"
            newline     = ~"[\r\n]"
            garbage     = (ws / newline)+
            """
        )
        
        tree = grammar.parse(text)
        
        class Vistor(NodeVisitor):
            def __init__(self, needle):
                self.needle = needle
        
            def generic_visit(self, node, visited_children):
                return visited_children or node
        
            def visit_key(self, node, children):
                _, key, _ = children
                return key
        
            def visit_value(self, node, children):
                _, value, _ = children
                return value
        
            def visit_pair(self, node, children):
                _, key, _, value, _ = children
                return (key, value)
        
            def visit_entry(self, node, children):
                _, entry, _ = children
                return entry
        
            def visit_file(self, node, children):
                out = [value.text
                       for child in children if isinstance(child, list)
                       for key, value in child
                       if key.text == self.needle]
                return out
        
        v = Vistor("accountID")
        out = v.visit(tree)
        print(out)
        

        产量

        ['J123456789', 'J987654321', 'C123456789', 'R987654321']
        

        【讨论】:

        • ...如果不推荐 - 那为什么要提出它作为答案呢?
        • @SpghttCd:我有点矛盾。一方面,这是结构化信息更准确的方式,另一方面,对于一次性任务而言,这似乎有点过头了。
        猜你喜欢
        • 2018-05-18
        • 1970-01-01
        • 2023-03-02
        • 2020-01-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-04-04
        • 1970-01-01
        相关资源
        最近更新 更多