【发布时间】:2020-01-15 04:16:08
【问题描述】:
Pandas 有一个非常快速和不错的字符串方法,extract()。此方法与正则表达式完美配合:
strict_pattern = r"^(?P<pre_spacer>ACGAG)(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT)"
test_df
R1
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT
test_df.R1.str.extract(strict_pattern)
pre_spacer UMI post_spacer
21 ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAG ACGTGTCCACCA TGGAGTCT
但由于它没有使用regex 包而是re(如果我没记错的话),它不支持使用允许不匹配的正则表达式。比如这个:
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
此正则表达式允许在 pre_spacer 和 post_spacer 序列中进行一次替换。
如本例所示,regex 包允许这种正则表达式:
seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()
{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}
我想要使 extract() 与这种正则表达式或任何快速解决方法兼容。
我已经这样做了,但比提取慢 12 倍,而且我处理非常大的数据帧。
def extract_regex(pattern, seq):
m = regex.match(pattern,seq)
try:
d=m.groupdict()
return list(d.values())
except AttributeError:
return [np.nan]*3
test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))
test_df
R1 pre_spacer UMI post_spacer
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT ACGAG ACGTGTCCACCA TGGAGTCT
关于如何调整 pandas extract() 方法或以类似速度提供所需功能的任何想法?
提前致谢!
保罗。
【问题讨论】:
-
事实上,
.str访问并不总是比简单的for循环快。我敢打赌,pd.DataFrame([extract_regex(lax_pattern,row.R1) for row in df.T])之类的东西会比.str.extract()快。
标签: regex python-3.x pandas extract fuzzy-search