【问题标题】:i need to filter or remove some lines from a file我需要从文件中过滤或删除一些行
【发布时间】:2019-04-19 18:07:23
【问题描述】:

这是输入文件,它的结构已经正确:

Name:  mr. Apple
class:  class 1
sub:  subject 1
ContactNo: 11111
Name:  mr. ball
class:  class  2
sub:  subject  2
ContactNo: 2222
Name:  mr. cat
class:  class 3
sub:  subject 3
ContactNo: 33333
class:  class 4
sub:  subject 4
ContactNo:44444
class:  class 5
sub:  subject 5
ContactNo: 55555
Name:  mr. tom
class:  class 9
sub:  subject 9
ContactNo: 99999

如您所见,有些细节没有名字。

例如: 等级:4级 子:主题4 联系电话:44444

我需要删除这些,只保留那些有名字的人的详细信息。

预期输出:

Name:  mr. Apple
class:  class 1
sub:  subject 1
ContactNo: 11111
Name:  mr. ball
class:  class  2
sub:  subject  2
ContactNo: 2222
Name:  mr. cat
class:  class 3
sub:  subject 3
ContactNo: 33333
Name:  mr. tom
class:  class 9
sub:  subject 9
ContactNo: 99999

我试过了:

errors = []                       # The list where we will store results.
linenum = 0
substr = "Name:".lower()          # Substring to search for.
substr1 = "class:".lower()
substr2 = "sub:".lower()
substr3 = "ContactNo:".lower()

with open ('scrap.txt', 'rt') as myfile:
    for line in myfile:
        linenum += 1
        if line.lower().find(substr) != -1:    # if case-insensitive match,
            errors.append(line)
        elif  line.lower().find(substr1) != -1:        
            errors.append(line)
        elif  line.lower().find(substr2) != -1:     
            errors.append(line)
        elif  line.lower().find(substr3) != -1:      
            errors.append(line)

for err in errors:
    fp = open("rawextract.txt","a")
    fp.write(err)
    fp.close()
    print(err)

但我不知道如何丢弃不完整的行。

【问题讨论】:

  • 是 Name、Class、Sub、ContactNo 的保证顺序还是 Name,SubC,Name,ContanctNo, ... 部分也可以?
  • @PatrickArtner 顺序始终相同。

标签: python python-3.x text


【解决方案1】:

您可以将re.findall 与匹配正确结构的预期标头序列的正则表达式模式一起使用:

import re
with open('scrap.txt') as myfile:
    for m in re.findall('Name:.*\nclass:.*\nsub:.*\nContactNo:.*', myfile.read()):
        print(m)

这个输出:

Name:  mr. Apple
class:  class 1
sub:  subject 1
ContactNo: 11111
Name:  mr. ball
class:  class  2
sub:  subject  2
ContactNo: 2222
Name:  mr. cat
class:  class 3
sub:  subject 3
ContactNo: 33333
Name:  mr. tom
class:  class 9
sub:  subject 9
ContactNo: 99999

【讨论】:

    【解决方案2】:

    你可以创建一个无穷无尽的迭代

    ['name:', 'class:', 'sub:', 'concatno:', 'name:', 'class:', ...]
    

    使用itertools.cycle

    然后检查该行是否包含下一个值,如果是,则将其写入结果,否则跳过它:

    创建数据文件:

    with open("f.txt","w") as f:
        f.write("""
    Name:  mr. Apple
    class:  class 1
    sub:  subject 1
    ContactNo: 11111
    Name:  mr. ball
    class:  class  2
    sub:  subject  2
    ContactNo: 2222
    Name:  mr. cat
    class:  class 3
    sub:  subject 3
    ContactNo: 33333
    Name:  mr. tom
    class:  class 9
    sub:  subject 9
    ContactNo: 99999
    """)
    

    计划:

    from itertools import cycle
    order = ["name:","class:","sub:","contactno:"]
    t = cycle(order)
    
    nxt = next(t) # name: 
    with open("f.txt") as f, open("mod.txt","w") as writer:
        for line in f:
            if nxt in line.lower():
                writer.write(line)
                nxt = next(t)       # advance to the next thing to be read
    
    print(open("mod.txt").read())
    

    输出:

    Name:  mr. Apple
    class:  class 1
    sub:  subject 1
    ContactNo: 11111
    Name:  mr. ball
    class:  class  2
    sub:  subject  2
    ContactNo: 2222
    Name:  mr. cat
    class:  class 3
    sub:  subject 3
    ContactNo: 33333
    Name:  mr. tom
    class:  class 9
    sub:  subject 9
    ContactNo: 99999
    

    如果您的有效数据跳过了其中应包含的部分内容,这将失败:

    Name:  mr. tom    # taken
    class:  class 9   # taken
    sub:  subject 9   # taken, no contact number follows
    Name:  mr. tom    # skipped
    class:  class 9   # skipped
    sub:  subject 9   # skipped
    ContactNo: 0000   # then this will be taken
    

    您可以使用以下方法使其更加健壮:

    with open("f.txt") as f, open("mod.txt","w") as writer:
        for line in f:
            if nxt in line.lower():
                writer.write(line)
                nxt = next(t)       # advance to the next thing to be read
            elif starter in line.lower():
                print("Incomplete set - beginning next one")
                while True:
                    nxt = next(t)
                    if nxt == starter:
                        break
                    nxt = next(t)
                    writer.write(line)
    

    如果数据不完整,现在站在新的Name: ... 线上,这会重新开始,不会错过......

    【讨论】:

      猜你喜欢
      • 2010-10-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-09-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多