从文本文件中解析项目答案

【问题标题】：Parse items from text file从文本文件中解析项目
【发布时间】：2010-06-14 19:07:45
【问题描述】：

我有一个文本文件，其中包含 {[]} 标签内的数据。解析该数据的建议方法是什么，以便我可以使用标签内的数据？

示例文本文件如下所示：

'这是一堆在任何{[方式]}中都没有{[really]}有用的文本。我需要 {[get]} 一些项目 {[from]} 它。'

我想在一个列表中以 'really'、'way'、'get'、'from' 结尾。我想我可以使用 split 来做到这一点.. 但似乎可能有更好的方法。我见过大量的解析库，有没有一个非常适合我想做的事情？

【问题讨论】：

标签： python string text-processing

【解决方案1】：

我会使用正则表达式。此答案假定没有任何标记字符 {}[] 出现在其他标记字符中。

import re
text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'

for s in re.findall(r'\{\[(.*?)\]\}', text):
    print s

在python正则表达式中使用详细模式：

re.findall('''
    \{   # opening curly brace
    \[   # followed by an opening square bracket
    (    # capture the next pattern
    .*?  # followed by shortest possible sequence of anything
    )    # end of capture
    \]   # followed by closing square bracket
    \}   # followed by a closing curly brace
    ''', text, re.VERBOSE)

【讨论】：

【解决方案2】：

这是正则表达式的工作：

>>> import re
>>> text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> re.findall(r'\{\[(\w+)\]\}', text)
['really', 'way', 'get', 'from']

【讨论】：

哇，太快了……而且完美。谢谢！
@chris：注意这一点：它只捕获分隔符之间的字母数字。如果您的数据有其他类型的字符，则不会提取它们。
阐述Bryan的评论，具体案例：连字符，{[anti-war]};带空格的复合词，{[New England]}；使用标点符号和空格的地名或人名，{[Boston, MA]}, {[George W. Bush]}。

【解决方案3】：

更慢，更大，没有正则表达式

老派的方式：P

def f(s):
    result = []
    tmp = ''
    for c in s:
        if c in '{[':
            stack.append(c)
        elif c in ']}':
            stack.pop()
            if c == ']':
                result.append(tmp)
                tmp = ''
        elif stack and stack[-1] == '[':
            tmp += c
    return result

>>> s
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> f(s)
['really', 'way', 'get', 'from']

【讨论】：

【解决方案4】：

另一种方式

def between_strings(source, start='{[', end=']}'):
    words = []
    while True:
        start_index = source.find(start)
        if start_index == -1:
            break
        end_index = source.find(end)
        words.append(source[start_index+len(start):end_index])
        source = source[end_index+len(end):]
    return words


text = "this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it."
assert between_strings(text) == ['really', 'way', 'get', 'from']

【讨论】：