【问题标题】:Remove lines from a tab file according to conditions in another tab file in python根据python中另一个选项卡文件中的条件从选项卡文件中删除行
【发布时间】:2020-03-21 03:03:34
【问题描述】:

我有两个标签文件,例如: file1.txt

Clustername Seqname1 Seqname2
Cluster1 Seq1(+) SeqA
Cluster1 Seq2(-) SeqA
Cluster1 Seq3(+) SeqB
Cluster1 Seq300(+) SeqB
Cluster1 Seq90(+) SeqL
Cluster1 Seq90(+) SeqO
Cluster1 Seq2(-) SeqC
Cluster2 Seq8(-) SeqY
Cluster2 Seq8(-) SeqH
Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY
Cluster3 Seq10(+) SeqK
Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT
Cluster4 Seq300(+) SeqB

file2.txt

Clustername Names
Cluster1    SeqA
Cluster1    Seq1(+)
Cluster1    SeqC
Cluster1    Seq2(-)
Cluster1    SeqO
Cluster1    Seq3(+)
Cluster1    Seq90(+)
Cluster1    SeqB
Cluster1    SeqG
Cluster2    Seq8(-)
Cluster2    SeqY
Cluster2    SeqH
Cluster3    Seq10(+)
Cluster3    SeqK
Cluster4    SeqB
Cluster4    Seq300(+)

正如您在file2.txt 中看到的那样,Cluster1 中不存在 SeqL,那么我想删除该行: Cluster1 Seq90(+) SeqL 来自 file1.txt

Seq300(+)Cluster1 中也不存在,然后我删除该行:

Cluster1 Seq300(+) SeqB

来自 file1.txt

同样适用于:

Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY

file2.txt中的CLuster2中没有SeqPCluster2中也没有Seq79(-),然后我删除行:

Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY

来自 file1.txt

同样适用于:

Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT

因为SeqSSeqT不在file2.txt中的Cluster2中,所以我从file1.txt中删除以下两行:

 Cluster3 Seq10(+) SeqS
 Cluster3 Seq10(+) SeqT

最后我应该得到一个 ex file1.txt,例如:

Clustername Seqname1 Seqname2
Cluster1 Seq1(+) SeqA
Cluster1 Seq2(-) SeqA
Cluster1 Seq3(+) SeqB
Cluster1 Seq90(+) SeqO
Cluster1 Seq2(-) SeqC
Cluster2 Seq8(-) SeqY
Cluster2 Seq8(-) SeqH
Cluster3 Seq10(+) SeqK
Cluster4 Seq300(+) SeqB

【问题讨论】:

    标签: python python-3.x pandas dataframe merge


    【解决方案1】:

    使用DataFrame.merge + DataFrame.reindex 获取原始列:

    new_df=( df1.merge(df2,left_on=['Clustername','Seqname1'],right_on=['Clustername','Names'])
                .merge(df2,left_on=['Clustername','Seqname2'],right_on=['Clustername','Names'])
                .reindex(columns=df1.columns))
    print(new_df)
    

    输出

      Clustername   Seqname1 Seqname2
    0    Cluster1    Seq1(+)     SeqA
    1    Cluster1    Seq2(-)     SeqA
    2    Cluster1    Seq2(-)     SeqC
    3    Cluster1    Seq3(+)     SeqB
    4    Cluster1   Seq90(+)     SeqO
    5    Cluster2    Seq8(-)     SeqY
    6    Cluster2    Seq8(-)     SeqH
    7    Cluster3   Seq10(+)     SeqK
    8    Cluster4  Seq300(+)     SeqB
    

    n seqnames 列的解决方案:

    df1['aux']=df1.groupby('Clustername').cumcount()
    
    new_df= ( df1.melt(['Clustername','aux'],var_name='Seq')
                 .merge(df2,left_on=['Clustername','value'],right_on=['Clustername','Names'])
                 .groupby(['Clustername','aux'])
                 .filter(lambda x: x.value.size>=(len(df1.columns)-2))
                 .pivot_table(index=['Clustername','aux'],columns='Seq',values='value',aggfunc=''.join)
                 .reset_index()
                 .drop('aux',axis=1)
                 .rename_axis(columns=None) )
    
    print(new_df)
    

    输出

      Clustername   Seqname1 Seqname2
    0    Cluster1    Seq1(+)     SeqA
    1    Cluster1    Seq2(-)     SeqA
    2    Cluster1    Seq3(+)     SeqB
    3    Cluster1   Seq90(+)     SeqO
    4    Cluster1    Seq2(-)     SeqC
    5    Cluster2    Seq8(-)     SeqY
    6    Cluster2    Seq8(-)     SeqH
    7    Cluster3   Seq10(+)     SeqK
    8    Cluster4  Seq300(+)     SeqB
    

    【讨论】:

      【解决方案2】:

      创建一个包含所有必要值的列 df1 是file1.txt,df2 是file2.txt

      df1['cs1'] = df1['Clustername'] + ' ' + df1['Seqname1']
      df1['cs2'] = df1['Clustername'] + ' ' + df1['Seqname2']
      
      df2['seq2'] = df2['Names'][~df2['Names'].str.contains('(\()')]
      
      df2['cs1'] = df2['Clustername'] + ' ' + df2['Names']
      df2['cs2'] = df2['Clustername'] + ' ' + df2['seq2']
      
      result = df1[(df1['cs1'].isin(df2['cs1'])) & (df1['cs2'].isin(df2['cs2']))]
      

      过滤所需的列 result[['Clustername', 'Seqname1', 'Seqname2']]

         Clustername  Seqname1 Seqname2
      0     Cluster1   Seq1(+)     SeqA
      1     Cluster1   Seq2(-)     SeqA
      2     Cluster1   Seq3(+)     SeqB
      5     Cluster1  Seq90(+)     SeqO
      6     Cluster1   Seq2(-)     SeqC
      7     Cluster2   Seq8(-)     SeqY
      8     Cluster2   Seq8(-)     SeqH
      11    Cluster3  Seq10(+)     SeqK
      12    Cluster4  Seq300(+)    SeqB
      

      【讨论】:

      • 我不认为这是正确的,因为它没有在每个 Clustername 中进行检查。所以...例如 Seq300 可能出现在 Cluster4 中,但您不会从 Cluster1 中删除 Seq300,因为您检查的是整个文件,而不仅仅是每个集群名称。
      • 你能把这个案例添加到你的例子中让我检查我的逻辑
      • @ksooklall 我将其添加到示例中
      • 什么是 df2['seq2'] ?
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-04
      • 2011-10-24
      • 2011-01-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多