【发布时间】:2018-10-31 15:25:20
【问题描述】:
我有这个名为 text.txt 的数据。我也有下面的代码。我想提取行值并想用它制作一个表格。我也想看看有没有更好的方法。谢谢
test.txt
Counting********************File: bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT:
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC:
73764
Counting********************File: bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT:
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC:
78640
Counting********************File: bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT:
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC:
26267
我想要的结果:
File Name Seq_132582_1 Seq_483974_49238
0 bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001 0 73764
1 bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001 0 78640
2 bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq 0 26267
我试过的代码:
import sys
if sys.version_info[0] < 3:
raise Exception("Python 3 or a more recent version is required.")
import re
import pandas as pd
text = open("text.txt",'r').read()
print(type(text))
results = re.findall(r'(bbduk_trimmed.*.fastq)\nSeq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: \n(\d)\nSeq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: \n(\d*)',text)
df=pd.DataFrame(results)
# df.columns=['FileName','Seq_132582_1','Seq_483974_49238'] #This doesn't work
print(df)
【问题讨论】:
标签: python text-parsing