来自两个大型数据集的匹配和不匹配行答案

【问题标题】：matching and unmatched lines from two large data sets来自两个大型数据集的匹配和不匹配行
【发布时间】：2013-03-15 08:46:29
【问题描述】：

我正在使用 Python 2.4，并且在 Python 方面还很陌生，可以进行一般和正则表达式编程。我有一个大模块，当前输出两个单独的流（或数据集/文件）的行，流 A 和流 B。我正在尝试将流 A 与流 B 进行比较，以查看流 B 中的任何字符串是否可以在任何行中匹配流 A。我想将所有匹配的内容和所有不匹配的内容作为两个单独的对象返回。请在下面查看我的问题，以粗体显示。有谁知道我可以如何克服这个问题或有最佳实践建议？

到目前为止，我已使用此代码将流 B（“realtimes”）转换为列表（“regexes”）并将该列表转换为一组正则表达式（“combined”）

请注意，我没有将模块中的所有代码都包含在内，只是我卡住的部分：

regex = re.compile(r'.*\[(\d{2}:\d{2}:\d{2}\.\d{6})\].*')
optsymbx = re.compile(r'\[(\d{2}:\d{2}:\d{2}\.\d{6})\][\s]+(trade),(S|B),(\d{1,}),(\w+)[\s]+([0-9A-Z]+),(\d+\.\d+)')
regexes = []

def realtimes():
    for x in realtrades():
        x = str(x)
        m = re.match(regex,x)
        if m:
            #regexes.append(str(m.groups()))
            yield str(m.groups())

#make contents of realtimes into group of regular expressions     
f = open(logfile,'r')
for x in realtimes():
    regexes.append(x)
combined = "(" + ")|(".join(regexes) + ")"

然后我查看流 A（f 中的行），并根据“组合”和一个额外的正则表达式标准（“optsymbx”）检查每一行，以查看是否存在匹配项，如下所示：

# checking if any lines in the logfile match "optsymbx" and any regular expressions wihtin "combined"
f = open(logfile,'r')
for line in f:
    m = re.match(combined,line)
    mopt = re.match(optsymbx,line)
    if not m:
        if mopt:
            print line

问题是流 A 和 B 非常大。流 A 包含超过 100,000 行，流 B 有数千行。因此，当我将 Stream B 的内容转换为一组正则表达式（“组合”）时，它超过了 100 个命名组的容量，并且出现错误：另外，我测试并知道这在我将 Stream B 内容的大小减少到少于 100 个命名组。

Traceback (most recent call last):
  File "badtrades.py", line 121, in ?
    m = re.match(combined,line)
  File "/usr/lib64/python2.4/sre.py", line 129, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib64/python2.4/sre.py", line 225, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib64/python2.4/sre_compile.py", line 506, in compile
    raise AssertionError(
AssertionError: sorry, but this version only supports 100 named groups

来自组合的样本数据（来自流 B）：

    ["('09:50:31.458370',)", **"('09:50:31.458370',)"**, "('09:50:48.343785',)", "('09:50:48.449219',)", "('09:50:48.449219',)", "('09:50:48.449219',)", "('09:50:48.449219',)", "('09:51:01.986971',)", "('09:51:01.986971',)", "('09:51:01.986971',)", "('09:51:34.543147',)", "('09:52:14.688349',)", "('09:52:14.688349',)", "('09:52:14.688349',)", "('09:52:14.688349',)", "('09:52:19.700134',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:54:39.295261',)", "('09:54:39.295261',)", "('09:54:44.883143',)", "('09:54:44.883143',)", "('09:54:44.883143',)", "('09:54:44.883143',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:19.767099',)", "('09:55:26.750094',)", "('09:55:26.750094',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.722747',)", "('09:56:38.809658',)", "('09:56:38.809658',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:58:37.573746',)", "('09:58:37.573746',)", "('09:58:37.573746',)", "('09:59:02.185210',)", "('09:59:09.245981',)", "('09:59:33.619633',)", "('09:59:33.619633',)", "('09:59:33.619633',)", "('09:59:33.619633',)"]

来自日志文件的样本数据（流 A）：

[09:49:52.515951] T,AAPL  130518C00450000,1,32.05
[09:49:53.568816] T,AAPL  130328P00455000,30,1.09
[09:49:53.811441] trade,S,2,AAPL  130328C00470000,4.75
[09:49:53.811447] trade,B,95,AAPL,468.69
--
[09:50:31.241441] T,AAPL  130328P00430000,3,0.08
[09:50:31.385327] T,AAPL  130328P00455000,5,1.10
[09:50:31.385911] T,AAPL  130328P00455000,5,1.10
[09:50:31.458370] trade,B,2,AAPL  130328C00475000,2.80
[09:50:31.458373] trade,S,68,AAPL,468.46
--
[09:50:48.339322] T,AAPL  130328C00485000,8,0.92
[09:50:48.339341] T,AAPL  130328C00485000,1,0.92
[09:50:48.339357] T,AAPL  130328C00485000,9,0.92
[09:50:48.343785] trade,B,2,AAPL  130328C00465000,7.05
[09:50:48.343789] trade,S,118,AAPL,468.19

匹配是：

data A:  [09:50:31.458370] trade,B,2,AAPL  130328C00475000,2.80
data B:  [09:50:31.458370]

没有匹配项是：

data A:  [09:49:53.811441] trade,S,2,AAPL  130328C00470000,4.75
data B:  #there is no timestamp from B which matches A

【问题讨论】：

您能否展示一些示例数据并指出应该匹配什么？另一个注意事项：也许您应该为此使用数据库。
只是一个建议：Python 2.4 已经很老了（2004 年！），您是否考虑过升级到新版本？
您可以对数据做出任何保证吗？是否已排序、格式化等？
Janne，我从每个文件中提供了一些示例数据。另外，我对数据库不太熟悉。我对编程很陌生，但我的工作要求我学习 python。
A. Rodas，我问我的雇主升级他们的 Python 版本，但目前我们服务器的操作系统只支持 2.4

标签： python regex stream match python-2.4

【解决方案1】：

忘记正则表达式。您应该仔细阅读每个流。尽可能推进和排序匹配项。一些伪代码：

Using Streams A and B, and Objects Match and NoMatch
FETCH next Line of A into LineA
FETCH next Line of B into LineB
WHILE Neither stream is at end of file:
    WHILE TimeStamp in LineA < TimeStamp in LineB:
        ADD LineA to NoMatch
        FETCH next Line of A into LineA
    WHILE TimeStamp in LineA = TimeStamp in LineB:
        Add (LineA, LineB) to Match
        FETCH next Line of A into LineA
    FETCH next Line of B into LineB
WHILE A is not at End of File
    ADD LineA to NoMatch
    FETCH next Line of A into LineA

这将处理 A 中的重复，但不能处理 B 中的重复。要处理 B 中的重复，您必须保持对过去行的记忆：

Using Streams A and B, and Objects Match, NoMatch and Temp
FETCH next Line of A into LineA
FETCH next Line of B into LineB
WHILE Neither stream is at end of file:
    CLEAR Temp
    WHILE TimeStamp in LineA < TimeStamp in LineB:
        ADD LineA to NoMatch
        FETCH next Line of A into LineA
    WHILE TimeStamp in LineA = TimeStamp in LineB:
        ADD (LineA, LineB) to Match
        ADD LineA to Temp
        SET Temp Timestamp to LineA Timestamp
        FETCH next Line of A into LineA
    FETCH next Line of B into LineB
    WHILE TimeStamp in LineB = Temp TimeStamp:
        FOR Line IN Temp:
            ADD (Line, LineB) TO Match
        FETCH next Line of B into LineB 
WHILE A is not at End of File
    ADD LineA to NoMatch
    FETCH next Line of A into LineA

编辑：我对如何确定 EOF 含糊不清。让我们假设在 EOF 之后读取返回和空字符串（就像在 python 中一样）。实现更像这样：

Using Streams A and B, and Objects Match, NoMatch and Temp
FETCH next Line of A into LineA
FETCH next Line of B into LineB
WHILE Neither LineA nor LineB is an Empty String:
    CLEAR Temp
    WHILE LineA is not an Empty String, AND TimeStamp in LineA < TimeStamp in LineB:
        ADD LineA to NoMatch
        FETCH next Line of A into LineA
    WHILE LineA is not an Empty String, AND TimeStamp in LineA = TimeStamp in LineB:
        ADD (LineA, LineB) to Match
        ADD LineA to Temp
        SET Temp Timestamp to LineA Timestamp
        FETCH next Line of A into LineA
    FETCH next Line of B into LineB
    WHILE LineB is not an Empty String, AND TimeStamp in LineB = Temp TimeStamp:
        FOR Line IN Temp:
            ADD (Line, LineB) TO Match
        FETCH next Line of B into LineB 

//At this point, Either LineA is empty (meaning there are no more strings to match),
//or LineB is empty (meaning there are no more matches to find).  If the first is true,
//This loop will be skipped.  Otherwise, this loop will put what's left of A into the 
//Not Matched Object.
WHILE  LineA is not an Empty String:
    ADD LineA to NoMatch
    FETCH next Line of A into LineA

【讨论】：

非常感谢你，弗兰基！明天我将不得不尝试您的解决方案，我会给您反馈。再次感谢
如果这对您有用，请接受答案。但请注意：此实现不能很好地处理文件中的空行。