Python 读取文件，忽略注释并匹配文本答案

【问题标题】：Python Read a file, ignore comments and match textPython 读取文件，忽略注释并匹配文本
【发布时间】：2015-12-22 11:49:52
【问题描述】：

我的目录：

    path = "C:\Users\\alopes\\afolder\\"

以 .proc 结尾的文件

    infile = glob.glob(os.path.join(path, '*.proc'))

更新代码：

   import re
   import os
   import glob
   import numpy as np
   from itertools import dropwhile

   pklist = []
   #regex for packets
   regTel = re.compile(r'[A-Z_]+[.][A-Z0-9_]+')

   path = "C:\Users\\alopes\\afolder\procs\\"
   infile = glob.glob(os.path.join(path, '*.proc'))
   for j in infile:
       with open(j, "r") as fobj:
           dp = dropwhile(lambda x:  x.startswith(";(C)"), fobj)
           regTel = re.compile(r'[A-Z_]+[.][A-Z0-9_]+')
           for line in dp:
               m = regTel.search(line)
               if m:
                   print(m.group())

我尝试将 m 放入另一个列表。目标是将每个文件中的所有匹配项放入一个列表中，以便在其他地方使用

               for n in m:
                   pklist.append(n)

【问题讨论】：

查看str.startswith
将; 添加到您的表达式中；类似r';(\s?[\w+._\s?]+)'

标签： python regex findall

【解决方案1】：

您可以使用itertools.dropwhile 跳过以;(C) 开头的行，然后搜索每一行：

from itertools import dropwhile

infile = "C:\Users\\alopes\\afolder\doc_name.ext"
with open(infile) as f:
    regTel = re.compile(r'[A-Z_]+[.][A-Z0-9_]+')
    for line in dropwhile(lambda x:  x.lstrip().startswith(";(C)"), f):
        m = regTel.search(line)
        if m:
            print(m.group())

输出：

HELLO_WORLD.THIS_IS_1_TEST

如果你想对多个文件运行它并获取所有行：

from itertools import dropwhile
def yield_matches(fles,ign):
    regTel = re.compile(r'[A-Z_]+[.][A-Z0-9_]+')
    for fl in fles:
        with open(fl) as f:
            for line in dropwhile(lambda x:  x.lstrip().startswith(ign)), f):
                m = regTel.search(line)
                if m:
                    yield m.group()

如果 cmets 可以出现在任何地方，只需使用 str.startswith 进行迭代，使用 fileinput.input 读取每个文件：

import fileinput
def yield_matches(fles,ign):
    regTel = re.compile(r'[A-Z_]+[.][A-Z0-9_]+')
    for line in fileinput.input(fles):
        if not line.lstrip().startswith(ign):
            m = regTel.search(line)
            if m:
                yield m.group()

只需调用传递文件名列表和要传递给startswith的字符串的函数。

l = some_list_of_files
for i in yield_matches(l, ";(C)"):
     print(i)

【讨论】：

不用担心，如果所有的 cmets 都在开始，dropwhile 将起作用，使用if line.startswith... 将在任何一种情况下都起作用
itertools.dropwhile 是个好主意
我让它与 line.strip().startswith(';(C)') 一起工作
字符串的开头有空格吗？如果是这样line.lstrip().startswith(';(C)') 就足够了，你只需要从头开始剥离
def skip_cmets(infile): with open(infile, "r") as fobj: for line in fobj: if not line.strip().startswith(';(C)'): yield skip_cmets(infile) 中的 line for line: packet = regTel.findall(line) for i in packet: print i

【解决方案2】：

你可以使用正则表达式：

re.compile(r'^(?!\s*;[(]C[)]).*?([A-Z_]+[.][A-Z0-9_]+)', re.MULTILINE)

^ 锚定到行首
(?!\s*;[(]C[)]) 是一个否定的前瞻：“后面没有;(C)”
.*? 消耗剩余的字符，直到
您的模式([A-Z_]+[.][A-Z0-9_]+)，用括号括起来以创建一个组并让findall() 返回该值

DEMO

【讨论】：