【发布时间】:2021-06-04 11:30:58
【问题描述】:
这是在上一篇文章 (regex match not working on simple string with Pyteomics parser) 之后发布的
我从 20,000 个字符串中生成了 >50,000 个字符串的数据框 (pep_df)。但是,我现在要做的是将 str 解析为单个字符并将这些列表返回到字典(pep_dict)中。我不断收到错误消息:
这是数据框的示例:
{'sequence': {0: 'VISYGGCVAQLFIFLALGSTECLLLAVMCFDR',
1: 'TEQGDSAAYLR',
2: 'DSLQVSK',
3: 'GSDALSETSSVSHIEDLEK',
4: 'QTDPQSSSAK',
5: 'IFGFQAGLTSLDCSGSYCLPVPVIPSFSTALYGK',
6: 'YQTEAVEMMDQIVHWVQEDASGLGRPQLQGAPAAEPMAVPMMLLNLVEQLGEADEELAGK',
7: 'QLEFAAQYPPTFDR',
8: 'EWPGDLYNNSVIVQAVR',
9: 'QIWHPNQTCDAAR',
10: 'YEHAFESSQK',
11: 'FISQWCGGLPSTSFSFQ',
12: 'QPSAFIVTQHPLPNTVK',
13: 'EVASNSELVQSSR',
14: 'ISHVSGYNYGIPHSCIR',
15: 'NALQYIHDGSSTR',
16: 'LDGGSGSTSSSGCHPGGAR',
17: 'LSLDQALVK',
18: 'ASAELLR',
19: 'SDSGPYPLTAR',
20: 'MGYFLPDDYK',
21: 'TEIQTLFK',
22: 'VFQPSVPATK',
23: 'ETVPSMETGDLCADTAPTPK',
24: 'GDDCLMK',
25: 'EGHCLAQDVEEQAR',
26: 'FEEITGVINPALDK',
27: 'STSTPTSPGPR',
28: 'ELPLHGR',
29: 'YIIIGDMGVGK',
30: 'TTCCCPSCCVSSCCRPQCCQSVCCQPTCCRPSCCISSCCHPSCCESSCCRPCCCVRPVCGR',
31: 'VTTFEHQYVSAIK',
32: 'VVSHPSGVLELHMK',
33: 'NITFDACLIQMFLIHFFSMMESGILLAMSFDR',
34: 'YPVPEESQEGTFVGNVAQDFLLDTDSLSAR',
35: 'HVVMELK',
36: 'VQWGLVMCFLSYFGTFAVEFR',
37: 'FEGGAEGR',
38: 'NLSALSDWYSVYTSAIAFTVYMNAVWHGWAIPLFLFLAILR',
39: 'TFSYGSSLIQHR',
40: 'TNTTAVGISKPANIHVK',
41: 'VPQLGPR',
42: 'TDDCHPWVLPVVK',
43: 'LCCAGHDR',
44: 'QQYLCQPLLDAVLANIR',
45: 'IYEQLPEVQK',
46: 'LYLTQAAGLEVPPEEMSLELPETHIEEK',
47: 'TIEDFESMNTYLQTSPSSVFTSK',
48: 'AGSVFGEISLLAAGGGNR',
49: 'QTIIGQPMSVTITTK',
50: 'ASDQCLK',
51: 'KPPGELLVSLEELEK',
52: 'ALSQPSSYSPSCTSSK',
53: 'LCPYFFANQEFYSLDSQLPIWGVR',
54: 'QFYEEELINSVVISQLSHIPEDK',
55: 'IEPMLETLENLSSR',
56: 'DALQLEMSLVQAR',
57: 'CHCGEPEHEETPENR',
58: 'SVSNAATR',
59: 'GFSQQEVQFEPELFHNTIVCEKPNNHLNK',
60: 'GAHIMNSTCAAMPK',
61: 'SDLGPSYGGWQVLDATPQER',
62: 'NEDACPVGTVSAAPWGSSSILPISWAYIK',
63: 'VLLEPLRPWACPR',
64: 'DSMTTENGK',
65: 'SSSYADPWTPPR',
66: 'VEDSHQILSQTSHDLNECSWSLNILAINKPQNK',
67: 'FLASVLPACGDLSFQQDQMTQTFGFR',
68: 'LHLQQHVSMEFLK',
69: 'MSNTQAER',
70: 'SALIVHQR',
71: 'TPELHLSGK',
72: 'EAFLSDR',
73: 'LYILFAAPPEK',
74: 'EKPFACTECGK',
75: 'AHCGPAELCEFYSR',
76: 'STDTSCQMAGLR',
77: 'GMLEPVQRPDVVLMGAGYR',
78: 'STLFLIPLFGIHYIIFNFLPDNAGLGIR',
79: 'NASGHTGDR',
80: 'NLTVSVHVSPVEGLCLAGGGGLAQQVLVPAGSARPVAFSVVPTAAAAVSLK',
81: 'DLHFDPSNAVVHVGGVLCVEITMYSQMPVPVHVEQIVVNVHFSIEK',
82: 'ETGLCADFHPSGAVVAVGLNTGR',
83: 'LIQPHVQASNNCWEEAISQVDK',
84: 'HLNSILVLDLR',
85: 'CQEQAQTTDWR',
86: 'HGYMIVGDPMGGK',
87: 'LNVPQVLLPFGR',
88: 'SASACSTPTHTPQDSLTGVGGDVQEAFAQSSR',
89: 'MHFFNVPEPDGHIISPLLAGFYMFWTMIILLQVLIPISLYVSIEIVK',
90: 'LFGPGFANSSWSWVAPEGAGCR',
91: 'GGSAPGPDPSCWFDPNNICGGGLEPGLVFLGFLLVVGMGLAGAFLAHYVR',
92: 'AGYEGDGTLCSEMDPCTGLTPGGCSR',
93: 'FMPLDQWLYFDALDCLPEDGELLPSPEDCALR',
94: 'VCFNLGR',
95: 'NLSPTPASPNQGPPPQVPVSPGPPK',
96: 'SAHALLLPDDPPCHDLGCHPVLTVSWVLGCTLALVVSAFFVLNHLWLWAQACCSHR',
97: 'QGVLAVIDAYNTSNK',
98: 'DLEMFAR',
99: 'GFCMSTLR',
100: 'YGVIYSTPLPEK'}
}
这是不起作用的自定义函数:
def ButcherShop(df, target, rule, min_length=7,exception=None,max_legnth=100, pH=2.0):
raw = df[target]
string_catcher=re.compile(r'^([A-Z]+)$')
unique_peptides = set()
for peptide in raw:
new_peptides = parser.cleave(peptide, rule=rule,min_length=min_length,exception=exception)
unique_peptides.update(new_peptides)
print(f'Done,{len(unique_peptides)} sequences of >= 7 amino acids!')
pep_dic = [{'sequence': i} for i in unique_peptides]
pep_df = pd.DataFrame.from_dict(pep_dic)
for i, row in pep_df.iterrows():
unique_id = i
peptides = row['sequence']
pep_dic['parsed_sequence'] = re.findall(string_catcher,peptides)
pep_dic['xlength'] = len(peptides)
pep_dic['charge'] = int(round(electrochem.charge(peptides, pH=pH)))
pep_dic['mass']=int(round(Peptide_mass(peptides)))
pep_dic = [peptide for peptide in pep_dic if peptide['length'] <= int(max_length)]
return unique_peptides,pep_dic, pep_df
如果我能在这方面得到任何帮助,我将不胜感激。
更新的解决方案:
def ButcherShop(df, target, rule, min_length=7,exception=None,max_length=100, pH=2.0):
raw = df[target]
string_catcher=re.compile(r'^([A-Z]+)$')
unique_peptides = set()
for peptide in raw:
new_peptides = parser.cleave(peptide, rule=rule,min_length=min_length,exception=exception)
unique_peptides.update(new_peptides)
print(f'Done,{len(unique_peptides)} sequences of >= 7 amino acids!')
pep_dic = [{'sequence': i} for i in unique_peptides]
for row in pep_dic:
peptides = row['sequence']
row['parsed_sequence'] = re.findall(string_catcher,peptides)
row['length'] = len(peptides)
row['charge'] = int(round(electrochem.charge(peptides, pH=pH)))
row['mass']=int(round(Peptide_Mass(peptides)))
pep_dic = [peptide for peptide in pep_dic if peptide['length'] <= int(max_length)]
pep_df = pd.DataFrame.from_dict(pep_dic)
return unique_peptides,pep_dic, pep_df
【问题讨论】:
标签: python regex dataframe loops parsing