【问题标题】:Python regex ignore empty linesPython 正则表达式忽略空行
【发布时间】:2020-02-15 06:02:37
【问题描述】:

我有以下结构的数据:

[TimingPoints]
21082,410.958904109589,4,3,1,60,1,0
21082,-250,4,3,1,100,0,0
22725,-142.857142857143,4,3,1,100,0,0
23547,-166.666666666667,4,3,1,100,0,0

24369,-333.333333333335,4,3,1,100,0,0
27657,-200.000000000001,4,3,1,100,0,0
29301,-142.857142857143,4,3,1,100,0,0
30123,-166.666666666667,4,3,1,100,0,0
30945,-250,4,3,1,100,0,0

32588,-166.666666666667,4,3,1,100,0,0
34232,-250,4,3,1,100,0,0
35876,-142.857142857143,4,3,1,100,0,0
36698,-166.666666666667,4,3,1,100,0,0
37520,-250,4,3,1,100,0,0
42451,-142.857142857143,4,3,1,100,0,0


[HitObjects]
256,192,17794,12,0,20876,0:0:0:0:
159,96,21082,6,0,B|204:120|204:120|254:103|254:103|305:130|355:102,1,210
409,27,22725,2,0,P|446:96|405:179,1,171.499994766236
269,284,23547,2,0,B|317:250|324:193|324:193|328:220|350:236,1,146.999995513916

我想阅读列表中 [HitObjects] 之前 [TimingPoints] 下的所有行。 应该忽略空行。 所以最终列表应该包含:

21082,410.958904109589,4,3,1,60,1,0
21082,-250,4,3,1,100,0,0
22725,-142.857142857143,4,3,1,100,0,0
23547,-166.666666666667,4,3,1,100,0,0
24369,-333.333333333335,4,3,1,100,0,0
27657,-200.000000000001,4,3,1,100,0,0
29301,-142.857142857143,4,3,1,100,0,0
30123,-166.666666666667,4,3,1,100,0,0
30945,-250,4,3,1,100,0,0
32588,-166.666666666667,4,3,1,100,0,0
34232,-250,4,3,1,100,0,0
35876,-142.857142857143,4,3,1,100,0,0
36698,-166.666666666667,4,3,1,100,0,0
37520,-250,4,3,1,100,0,0
42451,-142.857142857143,4,3,1,100,0,0

我使用以下正则表达式模式进行了尝试: \[TimingPoints\]((.|\n)*)\[HitObjects] 但它不会忽略空行。 如何匹配线条以获得上述内容? 另外,如何使用 python 加载列表中的所有匹配行?

【问题讨论】:

  • 这是来自 CSV/纯文本文件吗?也许使用with open(myfile.csv, 'r') as f: text = f.readlines()。然后,您可以使用条件语句删除列表中 HitObjects 之后的所有行。或者使用 Pandas 进行过滤...
  • @S3DEV 它是一个纯文本文件。我真的必须手动手动删除所有空行吗?难道没有一个简单的正则表达式单行来进行相应的过滤并将行存储在一个列表中吗?
  • 先获取文本,然后删除空行。空行通常表示\n\n,您可以将其替换为\n - text = text.replace('\n\n', '\n')

标签: python regex filtering


【解决方案1】:

不要误会我的意思,我是正则表达式的忠实粉丝,并且每天都在使用它。但是这个任务有点重。

1) 将文件读入list 并去除所有空格(包括换行符),如果为空则删除该行
2) 查找“[HitObjects]”的索引并从列表中修剪,以及标题
3) 完成

示例代码:

path = './timing.txt'

with open(path, 'r') as f:
    text = [i.strip() for i in f if i.strip()]

# Keep only rows between the headers of interest.
result = text[text.index('[TimingPoints]')+1:text.index('[HitObjects]')]

输出:

['21082,410.958904109589,4,3,1,60,1,0',
 '21082,-250,4,3,1,100,0,0',
 '22725,-142.857142857143,4,3,1,100,0,0',
 '23547,-166.666666666667,4,3,1,100,0,0',
 '24369,-333.333333333335,4,3,1,100,0,0',
 '27657,-200.000000000001,4,3,1,100,0,0',
 '29301,-142.857142857143,4,3,1,100,0,0',
 '30123,-166.666666666667,4,3,1,100,0,0',
 '30945,-250,4,3,1,100,0,0',
 '32588,-166.666666666667,4,3,1,100,0,0',
 '34232,-250,4,3,1,100,0,0',
 '35876,-142.857142857143,4,3,1,100,0,0',
 '36698,-166.666666666667,4,3,1,100,0,0',
 '37520,-250,4,3,1,100,0,0',
 '42451,-142.857142857143,4,3,1,100,0,0']

【讨论】:

  • 但是,这会假设之前没有数据?请记住,[TimingPoints] 之前可能还有其他数据。那么我们在提取数据之前还需要弄清楚该行的索引吗?
【解决方案2】:

我不是你见过的最大的正则表达式粉丝。这是一种无需正则表达式的简单方法:

#!/usr/local/cpython-3.1/bin/python3  

# Works on CPython 3.1 through 3.9  

"""Keep lines between [TimingPoints] and [HitObjects], elliding empty lines."""  


def main():  
    """Read input and write actual-output."""  
    display = False                            
    with open('input', 'r') as infile, open('actual-output', 'w') as outfile:  
        for line in infile:                                                    
            line_sans_n = line.rstrip('\n')  
            if not line_sans_n:              
                # Skip blank lines  
                continue            
            if line_sans_n == '[HitObjects]':  
                display = False                
            if display:           
                print(line_sans_n, file=outfile)  
            if line_sans_n == '[TimingPoints]':   
                display = True

HTH

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-09-13
    • 2020-08-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-08-12
    • 1970-01-01
    相关资源
    最近更新 更多