【发布时间】:2021-05-03 07:35:18
【问题描述】:
我在连接两个(大量)文件时遇到了一个大问题,因为我已经尝试了 join 命令和其他 AWK 选项的所有可能组合,我在其他用户的问题中看到了,但结果总是一样的:它不生成输出(我知道有共同的领域)。为了说明问题,我将部分文件留在这里: 文件 1:
SiiA lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
SiiB lcl|NC_003197.2_prot_NP_463123.1_4112 100.000 100 MKYINHYRYLFVCFFLAILPFFALSFPGIREYVFDNFMVSAIYNGVIIAIYITGSLCALFTILKNISAKDILIAQDASRKNSILSNLNQVLFAGESKQCDFNLLMELDDNVSTARNQRLSFIMSCSNVSTLVGLLGTFAGLSITIGSIGNLLSSPSDVGGDNASNTLNMIVTMVASLSEPLKGMNTAFVSSIYGVVCAILLTSQSVFVRSSYSLVSTEIKKLKIISNRANNKQRSLRVESETLVEFKELFKAFFDNYLTVENLRTQDEEKKREMLSDSFVTLQNRLLDNSAKLEQISTLIDGYLVSSNENLKKLSDGVITITSRLSEGNILLADNNARLEAMSTIQNIIDKKNDSIMTSVDKCYQESLSHGKTINDIAAGSADISHTLDGLRKEMDEDMNNVHLALSDLSATDKKIIANTKEISAEMVSYRDTYMPLMEKITSMHQEIVKQRLLNKEEKNED
文件 2:
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463122.1_4111
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_000427862.1_2024
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_079774928.1_2025
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_000168315.1_2026
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_079774927.1_2027
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463123.1_4112
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463124.1_4113
预期的输出是:
SiiA Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_bongori lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
SiiB Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463123.1_4112 100.000 100 MKYINHYRYLFVCFFLAILPFFALSFPGIREYVFDNFMVSAIYNGVIIAIYITGSLCALFTILKNISAKDILIAQDASRKNSILSNLNQVLFAGESKQCDFNLLMELDDNVSTARNQRLSFIMSCSNVSTLVGLLGTFAGLSITIGSIGNLLSSPSDVGGDNASNTLNMIVTMVASLSEPLKGMNTAFVSSIYGVVCAILLTSQSVFVRSSYSLVSTEIKKLKIISNRANNKQRSLRVESETLVEFKELFKAFFDNYLTVENLRTQDEEKKREMLSDSFVTLQNRLLDNSAKLEQISTLIDGYLVSSNENLKKLSDGVITITSRLSEGNILLADNNARLEAMSTIQNIIDKKNDSIMTSVDKCYQESLSHGKTINDIAAGSADISHTLDGLRKEMDEDMNNVHLALSDLSATDKKIIANTKEISAEMVSYRDTYMPLMEKITSMHQEIVKQRLLNKEEKNED
我尝试过使用 join 命令:
join -j2 -o1.1,2.1,1.2,1.3,1.4, 1.5 <(sort -k2 file1) <(sort -k2 file2)(这不起作用,因为它说我没有正确使用该命令)
join -2 1 -2 2 <(sort -k2 file1) <(sort -k2 file2)
还有一些 AWK 的选项:
awk '{if (NR==FNR) {a[$2]=$1; next} if ($2 in a) {print $1, a[$2] $2, $3, $4, $5}}' file1 file2
或者
awk 'FNR==NR{a[$2]=$1;next} a[$2]==$2{print $0, a[$2]}' file1 file2
我没有其他可以尝试的方法,或者我可以在哪里阅读有关此内容的信息,因为似乎没有任何效果。 提前感谢您的时间:)
【问题讨论】:
-
调试它,例如通过分而治之。从每个文件中删除一半的列。你还有问题吗?如果没有重新开始,但删除另一半列。你还有问题吗?如果是,请从新文件中删除一半的列。你还有问题吗?重复直到你不再有问题(你可能会自己找出解决方案)或者有一个最小的文件对你可以用于minimal reproducible example你的问题。