如何计算特定列中的项目/值在熊猫数据框的另一列/其他列中重复的次数？答案

【问题标题】：How to count the number of times a item/value from a particular column is repeated in another/other column of a pandas dataframe?如何计算特定列中的项目/值在熊猫数据框的另一列/其他列中重复的次数？
【发布时间】：2017-07-13 05:24:32
【问题描述】：

我有如下熊猫数据：

MA1     MA2     MA3        Sp3              Sp4     Sp6            F1_x     F1_y
TgT,TgT   TgT,TgT       TgT,TgT,TgT   TgT,TgC           TgT,CgC    TgT,TgC,CgT,CgC     CgC     TgT
CgT       CgT,CgT,CgT   CgT,CgT       CgT,CgC,GgT,GgC   CgT,GgC    GgT,GgC,CgT         GgC      CgT
TgC       TgG,TgC       TgC           TgC,CgG           CgG,CgG    TgG,TgC             CgG      TgC

问题 01：

我将读取 F1_x 和 F1_y 中的字符串值并想计算 其他列中有多少 F1_x 和 y？
F1_x 的计数将首先写入，用竖线 (|) 分隔。

输出：第一行是

MA1  MA2     MA3      Sp3   Sp4   Sp6      F1_x    F1_y
0|2      0|2      0|3     0|1    1|1       1|1     CgC     TgT

问题 02： 此外，我想创建另一个数据框，其中对 M 型与 S 型列进行计数。

输出：第一行是

        like_M      like_S
        x   y       x   y
         0|7         1|3

或者，

    like_M      like_S     F1_x    F1_y
    0|7         1|3        CgC     TgT

我尝试了一种使用 for 循环的方法，该方法非常广泛，因为我的数据很大。我试图采用@piRSquared 在这个问题中给出的这种方法：How to read two lines from a file and create dynamics keys in a for-loop using python? 但是，无法解决。

【问题讨论】：

标签： python string pandas dataframe count

【解决方案1】：

考虑基于numpy 的辅助函数count_in

def count_in(clst, cols):
    cols = np.asarray(cols)
    c1 = np.core.defchararray.split(np.asarray(clst).astype(str), ',')
    l = np.array([len(i) for i in c1])
    s = np.concatenate(c1)
    r = np.arange(len(cols))
    c = (s[:, None] == cols[r.repeat(l)]).cumsum(0)

    z = np.zeros(cols.shape[1], dtype=int)
    counts = np.diff(np.vstack([z, c[l.cumsum() - 1]]), axis=0).astype(str)
    return pd.Series(counts.tolist(), clst.index).str.join('|')

然后apply

cols = ['F1_x', 'F1_y']
d1 = df.drop(cols, 1).apply(count_in, cols=df[cols])
d1.join(df[cols])

   MA1  MA2  MA3  Sp3  Sp4  Sp6 F1_x F1_y
0  0|2  0|2  0|3  0|1  1|1  1|1  CgC  TgT
1  0|1  0|3  0|2  1|1  1|1  1|1  GgC  CgT
2  0|1  0|1  0|1  1|1  2|0  0|1  CgG  TgC

然后

d2 = d1.stack().str.split('|', expand=True).astype(int)
d3 = d2.groupby(
    [d2.index.get_level_values(0), d2.index.get_level_values(1).str[0]]
).sum()
pd.Series(
    d3.astype(str).values.tolist(), d3.index
).str.join('|').unstack().rename(columns='like_{}'.format).join(df[cols])

  like_M like_S F1_x F1_y
0    0|7    2|3  CgC  TgT
1    0|6    3|3  GgC  CgT
2    0|3    3|2  CgG  TgC

【讨论】：

感谢您的回答。如果您有时间，请您添加一些解释。我正在尝试阅读 pandas doc 和您的脚本，以了解每段代码在做什么。任何信息都会有所帮助。