从 .csv 文件中提取用户 ID 号答案

【问题标题】：Extract user id numbers from a .csv file从 .csv 文件中提取用户 ID 号
【发布时间】：2019-04-08 08:22:45
【问题描述】：

我有一个包含用户信息的 csv 文件。该文件的示例如下。

 "userType": "NORMAL",   "accountID": "J123456789"
 "userType": "NORMAL",   "accountID": "J987654321"
 "userType": "NORMAL",   "accountID": "C123456789"
 "userType": "NORMAL",   "accountID": "R987654321"

我想在 python 3 中使用正则表达式获取 id 号。

我使用的正则表达式是("accountID": ")\w+，它会生成以下结果。

"accountID": "J123456789
"accountID": "J987654321
"accountID": "C123456789
"accountID": "R987654321

所需的输出应如下所示，

【问题讨论】：

使用"accountID": "(\w+) 和re.findall
为了感兴趣，请：为什么这个任务需要re？
如果你有"accountID": "J123456789"，你怎么会期待J987654321？

标签： python regex python-3.x

【解决方案1】：

恕我直言，这根本不需要任何导入：

with open('test.csv') as f:
    for line in f:
        print(line.strip()[-11:-1])

或者如果帐户 ID 的长度确实不同，请使用：

        print(line.split('"')[-2])

在循环中。

【讨论】：

【解决方案2】：

如果文件格式是固定的，考虑自动检测方言：

import csv

with open('test.csv') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    accounts = [row[2] for row in reader]

此代码将生成以下列表：

accounts
['J000025574', 'J000025620', 'C000025623', 'R000025624']

【讨论】：

【解决方案3】：

您可以使用以下正则表达式 "(?:\"accountID\": \")(\S+)\" 仅捕获 ID 并忽略其余部分

import re

s = """"userType": "NORMAL",   "accountID": "J123456789"
 "userType": "NORMAL",   "accountID": "J987654321"
 "userType": "NORMAL",   "accountID": "C123456789"
 "userType": "NORMAL",   "accountID": "R987654321" """

print(re.findall("(?:\"accountID\": \")(\S+)\"",s))

结果：

['J123456789', 'J987654321', 'C123456789', 'R987654321']

【讨论】：

【解决方案4】：

你可以给自己写一个解析器（虽然可能有点过头了）：

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

text = """
 "userType": "NORMAL",   "accountID": "J123456789"
 "userType": "NORMAL",   "accountID": "J987654321"
 "userType": "NORMAL",   "accountID": "C123456789"
 "userType": "NORMAL",   "accountID": "R987654321"
"""

grammar = Grammar(
    r"""
    file        = entry+

    entry       = garbage? (pair)+ newline
    pair        = ws? key equal value comma?

    key         = quotes word quotes
    value       = quotes word quotes
    quotes      = '"'
    word        = ~"\w+"
    equal       = ws? ":" ws?
    comma       = ws? "," ws?

    ws          = ~"[\t ]+"
    newline     = ~"[\r\n]"
    garbage     = (ws / newline)+
    """
)

tree = grammar.parse(text)

class Vistor(NodeVisitor):
    def __init__(self, needle):
        self.needle = needle

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_key(self, node, children):
        _, key, _ = children
        return key

    def visit_value(self, node, children):
        _, value, _ = children
        return value

    def visit_pair(self, node, children):
        _, key, _, value, _ = children
        return (key, value)

    def visit_entry(self, node, children):
        _, entry, _ = children
        return entry

    def visit_file(self, node, children):
        out = [value.text
               for child in children if isinstance(child, list)
               for key, value in child
               if key.text == self.needle]
        return out

v = Vistor("accountID")
out = v.visit(tree)
print(out)

产量

['J123456789', 'J987654321', 'C123456789', 'R987654321']

【讨论】：

...如果不推荐 - 那为什么要提出它作为答案呢？
@SpghttCd：我有点矛盾。一方面，这是结构化信息更准确的方式，另一方面，对于一次性任务而言，这似乎有点过头了。