【问题标题】:How to extract characters and numbers from every line of a file?如何从文件的每一行中提取字符和数字?
【发布时间】:2014-07-23 13:32:34
【问题描述】:

我尝试从文件的每一行中提取第一个字符、第二个数字和第三个字符并存储到三个变量中,分别称为 FirstChar、SecondNum、ThirdChar。

输入文件(MultiPointMutation.txt):

P1T,C11F,E13T
L7A
E2W

预期输出:

FirstChar="PCELE"
SecondNum="1 11 13 7 2"
ThirdChar="TFTAW"

我的代码:

 import re 
 import itertools
 ns=map(lambda x:x.strip(),open('MultiplePointMutation.txt','r').readlines())#reading  file
 for line in ns:
         second="".join(re.findall(r'\d+',line))#extract second position numbers
         print second # print second nums
         char="".join(re.findall(r'[a-zA-Z]',line))#Extract all characters
         c=str(char.rstrip())
         First=0
         Third=1
         for index in range(len(c)):
                 if index==First:
                         FC=c[index]#here i got all first characters
                         print FC
                         First=First+2
                 if index==Third:
                         TC=c[index]
                         print TC
                         Third=Third+2#here i got all third characters

输出: 在这里,我得到的 FirstCharacter 和 ThirdCharacter 完全正确

FirstChar:
          P
          C
          E
          L
          E
ThirdChar:
          T
          F
          T
          A
          W

但问题在于获取 SecondNum:

           SecondNum:
           11113
           7
           2

我想提取数字如下:

          1
          11
          13
          7
          2

注意:在这里,我不想一一打印。我想一一读取这个 SecondNum 变量值以备后用。

【问题讨论】:

    标签: python regex string file-io extraction


    【解决方案1】:

    对于 secondNum,您可以简单地修改该行:

    second="".join(re.findall(r'\d+',line))#extract second position numbers
    

    second="\n".join(re.findall(r'\d+',line))#extract second position numbers
    

    但我认为您的第一个和第三个字符无法正常工作。从你想收到的第一个输出中,你应该有这样的东西:

     import re
    
     x= """P1T,C11F,E13T
     L7A
     E2W"""
    
     secondNum = []
     firstChar = []
     thirdChar = []
     for line in x.split('\n'):
    
          [secondNum.append(a) for a in re.findall('\d+',line)]
    
          [firstChar.append(a) for a in re.findall('(?:^|,)([a-zA-Z])',line)]
          # this is an inline for loop which takes each element returned from re.findall  
          # and appends it to the firstChar Array
          # the regex searchs for the start of the string (^) or a comma(,) and this is a 
          # non capturing group (starting with (?:  meaning that the result of this group 
          # is not considered for the returned result and finally capture 1 character 
          # [a-zA-Z] behind the comma or the start which should be the first character
    
          [thirdChar.append(a) for a in re.findall('(?:^\w\d+|,\w\d+)([a-zA-Z])',line)
          # the third char works quite similar, but the non capturing group searchs for a 
          # comma or start of the string again followed by 1 char and at least one number 
          # (\d+) after this number there should be the third character which is in the 
          # captured group again
    
     print "firstChar=\""+str(firstChar)+"\""
     print "secondNum=\""+str(secondNum)+"\""
     print "thirdChar=\""+str(thirdChar)+"\"" 
    

    但是你的第三个字符是 L7A 的第三个字符(你想要 A 的位置),但它也是 P1TQ 的第四个字符(你想要 Q 的位置)

    【讨论】:

    • 实际上我确实喜欢这个“用新行打印 secondnumber 变量它只会打印,但我想一个接一个地读取 SecondNum 变量以供以后使用”同时我可以读取值 FC 和TC 一个一个,但不是第二个
    • 感谢您的快速回复 gaw 和您的友好信息,他们是输入中的一个小更正
    • 我编辑了代码来创建你想要创建的元素的数组,这样你就可以一个一个地处理元素
    • 好吧,我会检查并告诉你
    • 能否解释一下你提取firstchar和thirdchar的逻辑和正则表达式
    猜你喜欢
    • 2017-08-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-09-23
    • 2015-08-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多