Pandas Dataframe：如何将整数解析为 0 和 1 的字符串？答案

【问题标题】：Pandas Dataframe: How to parse integers into string of 0s and 1s?Pandas Dataframe：如何将整数解析为 0 和 1 的字符串？
【发布时间】：2016-11-29 00:11:23
【问题描述】：

我有以下 pandas DataFrame。

import pandas as pd
df = pd.read_csv('filename.csv')

print(df)

      sample      column_A         
0     sample1        6/6    
1     sample2        0/4
2     sample3        2/6    
3     sample4       12/14   
4     sample5       15/21   
5     sample6       12/12   
..    ....

column_A 中的值不是分数，必须对这些数据进行处理，以便我可以将每个值转换为 0s 和 1s（而不是将整数转换为二进制对应物）。

上面的“分子”给出了1s 的总数，而“分母”给出了0s 和1s 的总数。

所以，表格实际上应该是以下格式：

      sample      column_A         
0     sample1     111111    
1     sample2     0000
2     sample3     110000    
3     sample4     11111111111100    
4     sample5     111111111111111000000 
5     sample6     111111111111  
..    ....

我从来没有像这样解析一个整数来输出 0 和 1 的字符串。如何做到这一点？是否有与lambda 表达式一起使用的“熊猫方法”？ Pythonic 字符串解析或正则表达式？

【问题讨论】：

我会说字符串解析，类似于a,b = map(int, field.split('/')); result = '1'*a + '0'*(b-a)。

标签： python regex parsing pandas

【解决方案1】：

首先，假设你写了一个函数：

def to_binary(s):
    n_d = s.split('/')
    n, d = int(n_d[0]), int(n_d[1])
    return '1' * n + '0' * (d - n)

这样，

>>> to_binary('4/5')
'11110'

现在你只需要使用pandas.Series.apply:

 df.column_A.apply(to_binary)

【讨论】：

【解决方案2】：

另一种选择：

df2 = df['column_A'].str.split('/', expand=True).astype(int)\
                    .assign(ones='1').assign(zeros='0')

df2
Out: 
    0   1 ones zeros
0   6   6    1     0
1   0   4    1     0
2   2   6    1     0
3  12  14    1     0
4  15  21    1     0
5  12  12    1     0

(df2[0] * df2['ones']).str.cat((df2[1]-df2[0])*df2['zeros'])
Out: 
0                   111111
1                     0000
2                   110000
3           11111111111100
4    111111111111111000000
5             111111111111
dtype: object

注意：我实际上是在尝试找到一种更快的替代方法，认为应用会很慢，但这个结果会更慢。

【讨论】：

我喜欢这个解决方案，但@AmiTavory 在此之前有一个不错的答案。我认为它也可能更快，但我没有检查这个。我希望我能接受这两个问题！
@ShanZhengYang 谢谢您，但您将此标记为正确。我认为您的意思是标记 Ami Tavory 的答案（这也是我的选择）。
这是一个非常有趣的问题，我喜欢这两个答案。这是我作为单线的尝试：df.column_A.str.extract(r'(?P<one>\d+)/(?P<len>\d+)', expand=True).astype(int).apply(lambda x: ['1'] * x.one + ['0'] * (x.len-x.one), axis=1).apply(''.join) - 它会更慢，只是想有一个单线......;）

【解决方案3】：

以下是一些使用extract() 和.str.repeat() 方法的替代解决方案：

In [187]: x = df.column_A.str.extract(r'(?P<ones>\d+)/(?P<len>\d+)', expand=True).astype(int).assign(o='1', z='0')

In [188]: x
Out[188]:
   ones  len  o  z
0     6    6  1  0
1     0    4  1  0
2     2    6  1  0
3    12   14  1  0
4    15   21  1  0
5    12   12  1  0

In [189]: x.o.str.repeat(x.ones) + x.z.str.repeat(x.len-x.ones)
Out[189]:
0                   111111
1                     0000
2                   110000
3           11111111111100
4    111111111111111000000
5             111111111111
dtype: object

或慢速（两个apply()）单线：

In [190]: %paste
(df.column_A.str.extract(r'(?P<one>\d+)/(?P<len>\d+)', expand=True)
   .astype(int)
   .apply(lambda x: ['1'] * x.one + ['0'] * (x.len-x.one), axis=1)
   .apply(''.join)
)
## -- End pasted text --
Out[190]:
0                   111111
1                     0000
2                   110000
3           11111111111100
4    111111111111111000000
5             111111111111
dtype: object

【讨论】：