Python正则表达式编译多个字符串答案

【问题标题】：Python regex compile for multiple stringPython正则表达式编译多个字符串
【发布时间】：2017-02-23 17:29:44
【问题描述】：

我正在尝试在数据下方实现一个模式。

-----------------------------------------------
| COLUMN_NAME          | DATA_TYPE            |
-----------------------------------------------
| C460                 | VARCHAR2             |
| C459                 | CLOB                 |
| C458                 | VARCHAR2             |
| C8                   | BLOB                 |
| C60901               | INT                  |

我能够创建pattern 来选择COLUMN_NAME 匹配的CLOB,BLOB，但我也想要INT 数据类型的COLUMN_NAME。在这种情况下，我应该得到C459,C8,C60901。

使用下面的代码，我只得到C60901，因为我使用了|，这不过是OR，但我想要CLOB和CLOB的COLUMN_NameINTC459,C8,C60901

#current code
COl_Re=re.compile('(?m)(C\d+ )(?=.+ NUMBER | [C]LOB)')
columns=COl_Re.findall(proc.stdout.read())

我已从该文档Regx documentation 中获得帮助，但我无法找到可靠的答案。

【问题讨论】：

我真的不明白这个问题。你能再校对一遍吗？

标签： python regex python-3.x

【解决方案1】：

我假设您只想获取 COLUMN_NAME 中的值，其中 DATA_TYPE 是 CLOB 或 INT。这将为您提供列表：

>>> text="""-----------------------------------------------
| COLUMN_NAME          | DATA_TYPE            |
-----------------------------------------------
| C460                 | VARCHAR2             |
| C459                 | CLOB                 |
| C458                 | VARCHAR2             |
| C8                   | BLOB                 |
| C60901               | INT                  |"""
>>> import re
>>> re.findall(re.compile("\| (\S+)\s*\| (?:CLOB|INT).*"),text)
['C459', 'C60901']

这适用于我在 Python 3.5.2 上

【讨论】：

您的代码输出为 0。但我的代码 COl_Re=re.compile('(?m)(C\d+ )(?=.+[I]NT| [CB]LOB)')只为我提供 int 列，但不提供 BLOB 或 CLOB 的结果。
是的，如果数据类型符合我的要求，在这种情况下，我只需要列名 CLOB 和 INT
很抱歉，我无法重现您的问题。我再次验证了我的代码并更新了我的答案。你用的是哪个python版本？
我正在使用 python 3.6
在python 2.7 这个re.findall(re.compile("\|\s*(\S+)\s*\|\s*(?:[CB]LOB|INT).*"),text) 匹配['C459', 'C8', 'C60901']

【解决方案2】：

我非常喜欢 Python 的 re 模块中有一个 VERBOSE 选项。代码应该是不言自明的（在 3.6 下检查）

import re

data = """
-----------------------------------------------
| COLUMN_NAME          | DATA_TYPE            |
-----------------------------------------------
| C460                 | VARCHAR2             |
| C459                 | CLOB                 |
| C458                 | VARCHAR2             |
| C8                   | BLOB                 |
| C60901               | INT                  |
"""

pattern = """
(C\d+)             # Match a capital C followed by at least one digit
(?:\s*\|\s)        # Non-matching group for \s - whitespace, \| - pipe, \s - whitespace
(?=INT|CLOB|BLOB)  # Positive Lookahead match INT, CLOB or BLOB
"""
match_column = re.compile(pattern, re.VERBOSE)
columns = match_column.findall(data)
print(list(columns))

这应该给你 ['C459', 'C8', 'C60901'] 这就是你所追求的。一旦你明白了，你可以写：r'(C\d+)(?:.*(?:INT|CLOB|BLOB))'。但是，对于冗长和特定匹配（空格和管道字符）有一些话要说，因为滥用 . 通常会导致匹配超出我最疯狂梦想的内容的正则表达式。

你真的不应该做上述任何事情！伟大的黑客 Jamie Zawinsky 曾经说过：

有些人在遇到问题时会想“我知道，我会使用正则表达式”。现在他们有两个问题。

如果你能够逐行处理输入，我会这样做：

result = []
interesting_columns = ('INT', 'CLOB', 'BLOB')
for line in data:
    fields = line.split()
    if any(col in fields for col in interesting_columns):
        result.append(fields[1])

【讨论】：