【发布时间】:2016-07-09 17:02:18
【问题描述】:
我想使用脚本来执行以下操作。我有两个文件。我想将 file2 中第 2 列(下例中的“A1”)和 3(下例中的“A2”)的内容替换为第 2 列(下例中的“A1_new”)和 3 (以下示例中的“A2_new”)在文件 1 中。有两个条件:
- 这必须基于第 1 列的两个文件中都存在的键变量(以下示例中的“SNP”)来完成。含义:脚本应检查文件 1 和文件 2 中是否存在键变量 X然后如上所述用 file1 替换 file2 的内容。如果没有匹配项,则不应执行任何操作。
- 两个文件都非常大:file2 的大小约为 38Gb(1,529 列和约 830 万行),file1 较小,因为它只有 6 列而不是 1,529 列。
- 我可以使 file1 只包含要实现的更改。我估计行数大约为 8,000。
- 我还应该注意,键的个数不一定相等,也不一定按相同的顺序排序!
下面我放了两个文件的头。因此,the_new_file2 的预期结果将如下所示:
snp a1 a2 humid00126-humid0126-a318-003-011890-upid01890-upid00306-upid00306-a318-0245-upid00629-upid00629-a318-0033 upid01934-upid01934-a318-0215 Upid00526- UPID00526-A318-0526
rs12565286 C G 0.0190125 0.0339966 0.0130005 0.00500488 0.0109863 0.0180054 0.00500488
rs11804171 A T 0.0249939 0.154999 0.0310059 0.0169983 0.0159912 0.0230103 0.0159912
rs2977670 C G 1.95801 1.745 1.31201 1.97299 1.97198 1.96301 1.97299
rs143225517 C T 0 0.93399 0.989014 0 0 0 0
rs3094315 A G 2 1 1 2 2 2 2
rs3131972 G A 1.99701 0.997009 0.997986 1.99701 1.99499 1.987 1.99701
rs3131971 C T 1.89301 1.07901 1.08902 1.98901 1.86801 1.92798 1.982
rs2073814 G C 1.98001 0.983002 0.993988 1.97601 1.98401 1.979 1.97601
rs2073813 A G 0.00299072 0.915039 0.899017 0.00299072 0.00299072 0.00201416 0.00299072
rs3131969 G A 1.996 1.03097 1.09598 1.99701 1.99701 1.99701 1.996
rs3131968 G A 1.996 1.02801 1.09396 1.996 1.99701 1.99701 1.99499
rs3131967 C T 1.931 1.01202 1.08701 1.98999 1.93298 1.97101 1.987
rs3115859 A G 1.996 1.02502 1.09497 1.99701 1.99399 1.98599 1.991
rs34927559 G A 1.91898 1.61298 1.845 1.91901 1.94299 1.77402 1.952
rs115173026的Tg 0.423004 0.420013 0.11499 1.09097 0.51001 0.618988 0.114013结果的 CHR1:956061:I I d 强> 0.356018 0.395996 0.10202 1.09903 0.530975 0.505005 0.0679932结果 rs116130865 T C 0.420014 0.420013 0.113983 1.09003 0.503968 0.535004 0.11203
rs111893801 T C 0.0239868 0.0180054 0.00601196 0.0209961 0.019989 0.0610046 0.00299072
您可以在上面的“预期结果”和下面的文件 1 和文件 2 的 head 中以粗体看到键“chr1:956061:I”的第 2 列和第 2 列发生了变化。从“AG G”到“ID”。
我在 bash 方面确实有一些经验。但我认为awk 会很慢,可能perl 或python 可能会快得多。我在这里看到了一些东西:How to search and replace text in a file using Python?。但这只是替换,而不是基于键变量。
如果有人能给我一些关于如何去做的指示,那就太棒了。
谢谢!
根据文件 1 和文件 2 的head 下方的流行请求。
头文件1
SNP A1_new A2_new
chr1:956061:I I D
chr1:3716325:I I D
chr1:4095845:I I D
chr1:4097079:I I D
头文件2
snp a1 a2 humid00126-humid0126-a318-003-011890-upid01890-upid00306-upid00306-a318-0245-upid00629-upid00629-a318-0033 upid01934-upid01934-a318-0215 Upid00526- UPID00526-A318-0526
rs12565286 C G 0.0190125 0.0339966 0.0130005 0.00500488 0.0109863 0.0180054 0.00500488
rs11804171 A T 0.0249939 0.154999 0.0310059 0.0169983 0.0159912 0.0230103 0.0159912
rs2977670 C G 1.95801 1.745 1.31201 1.97299 1.97198 1.96301 1.97299
rs143225517 C T 0 0.93399 0.989014 0 0 0 0
rs3094315 A G 2 1 1 2 2 2 2
rs3131972 G A 1.99701 0.997009 0.997986 1.99701 1.99499 1.987 1.99701
rs3131971 C T 1.89301 1.07901 1.08902 1.98901 1.86801 1.92798 1.982
rs2073814 G C 1.98001 0.983002 0.993988 1.97601 1.98401 1.979 1.97601
rs2073813 A G 0.00299072 0.915039 0.899017 0.00299072 0.00299072 0.00201416 0.00299072
rs3131969 G A 1.996 1.03097 1.09598 1.99701 1.99701 1.99701 1.996
rs3131968 G A 1.996 1.02801 1.09396 1.996 1.99701 1.99701 1.99499
rs3131967 C T 1.931 1.01202 1.08701 1.98999 1.93298 1.97101 1.987
rs3115859 A G 1.996 1.02502 1.09497 1.99701 1.99399 1.98599 1.991
rs34927559 G A 1.91898 1.61298 1.845 1.91901 1.94299 1.77402 1.952
rs115173026的Tg 0.423004 0.420013 0.11499 1.09097 0.51001 0.618988 0.114013结果的 CHR1:956061:我AG A 强> 0.356018 0.395996 0.10202 1.09903 0.530975 0.505005 0.0679932结果 rs116130865 T C 0.420014 0.420013 0.113983 1.09003 0.503968 0.535004 0.11203
rs111893801 T C 0.0239868 0.0180054 0.00601196 0.0209961 0.019989 0.0610046 0.00299072
rs9777931 T C 1.54102 1.27399 1.75003 0.863983 1.46597 1.33801 1.07199
rs3121553 C G 0.419007 0.364014 0.112976 1.08801 0.562989 0.546997 0.110016
rs809912 T C 1 1 0 1 1 0 1
rs196128 G T 1 1 0 1 1 0 1
rs262676 A T 1 1 0 1 1 0 1
rs196127 A T 1 1 0 1 1 0 1
rs149435595 C A 0.019989 0.0509949 0.478027 0.0580139 0.337006 0.0169983 0.705994
rs12082939 T C 1.021 1.17899 0.442963 1.22 1.008 0.00900269 1.056
rs262675 G A 0.998993 0.822998 0 1 0.997986 0 0.971985
rs262672 T G 1 1 0 1 1 0 1
rs262671 A C 1 1 0 1 0.997986 0 1
rs3107126 A C 0 0 0.0039978 0.0119934 0 0.00100708 0.0150146
rs11583093 A G 0.00799561 0.184998 0.0130005 0.00698853 0 0.0159912 0.00900269
rs262670 T C 1 1 0 1 1 0 1
rs262669 A G 1 1 0.0039978 1.01199 1 0 1.01501
rs9442619 T C 0 0 0.0039978 0.0109863 0 0 0.0150146
rs2257182 C T 1.021 1.17899 0.442963 1.22 1.008 0.00900269 1.056
rs262668 G A 1 1 0.0039978 1.01199 1 0 1.01501
rs262667 C T 1 1 0 1 1 0 1
rs34944978 A G 0 0 0 0 0 0 0
rs262665 G A 1 1 0 1 1 0 1
rs262664 C G 1 1 0.0039978 1.01199 1 0 1.01501
rs2503704 C T 1.43301 1.43503 1.28302 1.5 1.86999 1.444 1.85998
rs262663 C T 1 1 0.0039978 1.01199 1 0.00100708 1.01501
rs262662 G A 1 1 0.0039978 1.01199 1 0 1.01501
rs3107125 A G 0 0 0 0 0 0 0
rs2459984 G A 1.021 1.17899 0.442963 1.22 1.008 0.00799561 1.056
rs61776606 A G 0.0169983 0 0.00299072 0.0289917 0.00601196 0.676025 0
rs2459983 G A 1.39798 1.858 1.871 1.798 1.91199 1.54697 1.92499
rs262661 C T 1 1 0 1 1 0 1
rs262660 G A 1 1 0 1 1 0 1
chr1:2087094:I AATG A 0.997986 1.00101 0.00900269 1.04199 1 0.00201416 1.01599
rs3123592 C T 0 0 0.0039978 0.0109863 0 0 0.0159912
rs381664 G A 1 1 0 1 1 0 1
rs72629470 A G 1 1 0 1.00101 0 0 0
rs151048557 T C 0.0039978 0 0 0.00201416 0.0180054 0.00100708 0
rs72629471 G C 1 1 0 1 0 0 0
rs72629472 C G 1 0 0 1 0 0 0
rs116329948 T C 0 0.266998 0 0 0 0.0039978 0
rs16823335 A G 1 1 0 1 0 0 0
rs12028094 T C 1 1 0 1 0 0 0
rs141183701 T C 0 0 0.00201416 0 0.00100708 0.00299072 0.0150146
rs10797377 A G 1 1.004 0 1 0 0 0
rs72629473 A G 0.00201416 0 0 0.985992 0 0 0
rs12063869 A T 1 1.004 0 1 0 0 0
rs2889476 C T 1.08701 1.30301 0.019989 1.004 0.052002 0.00299072 0.0100098
rs946175 G A 1.08801 1.30301 0.019989 1.004 0.052002 0.00299072 0.00900269
rs946176 T C 1 1 0 1 0 0 0
rs946177 G A 1.08701 1.302 0.019989 1.004 0.052002 0.00299072 0.00900269
rs61763644 T C 0.0710144 0.0039978 0.0690002 0.0209961 0.00100708 0 0
rs2027262 C T 1 1 0 1 0 0 0
rs2027263 G A 1 1.004 0 1 0.00100708 0 0
rs2027264 A G 1 1 0 1 0 0 0
rs111412502 T C 1 1.00201 0 1 0 0 0
rs11802804 G T 1 1.00699 0 1 0.00100708 0 0
rs111668563 G T 1 0.944 0 1 0 0 0
rs112539394 A G 1 0.941986 0 1 0 0 0
rs12028439 G A 1.00601 1.01001 0 1.00201 0.00100708 0 0
rs12024879 A G 1 1.00903 0 1 0.00100708 0 0
rs12027818 T C 1.02399 1.013 0.00100708 1.03799 0.131989 0 0.00100708
rs6685625 C T 0.109985 0.0169983 0.00799561 0.131012 0.0390015 0.00299072 0.14801
rs12027883 T C 1 1.00998 0 1 0 0 0
rs7534897 T C 0.0130005 0.596985 0.562012 0.00500488 0.0159912 1.11197 0.317993
rs6681651 T C 1.09698 1.591 0.0390015 1.01999 0.0539856 0.00201416 0.00900269
rs6691208 A C 1.091 1.65201 0.019989 1.005 0.0539856 0.00299072 0.00799561
rs4357486 C G 1 1.04901 0.0700073 1 0 0 0.00299072
rs4486390 T C 1 1.04901 0.0700073 1 0 0 0.00299072
rs7536203 A G 0.437012 0 0.00100708 0.0119934 0.148987 0.00201416 0
rs59869348 C T 1.091 1.629 0.019989 1.004 0.0549927 0.00201416 0.00799561
rs59041920 T C 1.091 1.629 0.0190125 1.004 0.0549927 0.00201416 0.00799561
rs60366070 T C 0.0910034 0.604034 0.019989 0.00500488 0.0549927 0.00201416 0.00799561
rs56401121 A G 0.0910034 0.604034 0.019989 0.00500488 0.0549927 0.00201416 0.00799561
rs72629480 T C 1 1.061 0 1 0 0 0
rs58317195 G A 0.0910034 0.60498 0.019989 0.00500488 0.0549927 0.00201416 0.00799561
rs114933164 C A 0.645996 0.54602 0.0190125 0.0039978 0.0549927 0.00201416 0.00698853
rs6680039 G C 0.0920105 0.603027 0.0190125 0.00500488 0.0549927 0.00201416 0.00799561
rs114412925 A G 0.0270081 0 0 0.0450134 0.131012 0 0
rs6680573 G C 1.09201 1.67703 0.0190125 1.005 0.0559998 0.00201416 0.00799561
rs56200661 T C 1 0.0039978 0 0.919006 0 0 0
rs72629481 A G 1 0.0039978 0 0.919006 0 0 0
rs17389878 A G 0 0 0 0 0 0 0
rs116142632 T G 0.00500488 0.00201416 0.274994 0 0.00799561 0.567017 0.127991
rs16823357 C T 0 0.609985 0 0 0.00100708 0 0.00100708
rs12564456 G C 1 1.974 0 1 0.00100708 0.00100708 0.00201416
rs55949537 T A 1 1.978 0 1.00201 0.00100708 0 0.00201416
rs55784203 T G 1 1.98001 0 1.00201 0.00100708 0 0.00201416
rs56091202 G A 1 1.98001 0 1.00201 0.00100708 0 0.00201416
rs10909862 A G 1 0.0190125 2 0.989014 1.99799 1.99799 1.99701
rs10492936 A G 0 0.615021 0 0 0 0 0.00201416
rs67758087 G A 1 1.98401 0 1.00201 0.00100708 0 0.00201416
rs12033927 G A 1 1.37503 0 1.00101 0 0 0
rs12034794 G T 1 1.98999 0 1.00201 0.00100708 0 0.00201416
rs4576609 T C 1 1.99301 0 1 0 0 0
rs3795262 G T 1 1.98901 0 1 0 0 0
rs3795263 A G 0.0950012 0 0.453033 0.627014 1.10703 0.880981 0.627991
rs55873718 G A 1 1.996 0 1 0 0 0
rs61763646 A G 0.052002 0 0.0910034 0.0209961 0.0039978 0 0
rs56176404 G A 1 1.38098 0 1 0 0 0
rs7412983 C A 1 1.99701 0 1 0 0 0
rs58265521 A C 1 1.99701 0 1 0 0 0
rs117759489 A C 0.00201416 0 0.0039978 0.0119934 0.00201416 0 0
rs60287244 G A 1 1.99701 0 1 0 0 0
rs12034766 G A 1 1.99701 0 1 0 0 0
rs12033401 T C 1 1.35101 0 1 0 0 0
rs2142569 A G 0.105988 0 1.00403 0.628998 1.11801 1.992 0.930023
rs143965216 T C 0 0 0.00100708 0 0 0 0
rs57893585 A T 1 1.99799 0 1 0 0 0
rs189380075 G C 0 0.00299072 0.026001 0.00100708 0 0 0
rs6693975 A G 1 1.362 0 1 0 0 0
rs6693982 A G 0.878998 1.991 0 1 0 0 0
rs141204614 T G 0.00500488 0.00100708 0.276001 0 0.00799561 1.05502 0.130005
rs12038530 A T 1 1.99799 0 1 0 0 0
rs12037307 A C 1 1.35101 0 1 0 0 0
rs12039548 C T 1 1.99899 0 1 0 0 0
rs12034573 A G 1 1.99899 0 1 0 0 0
rs10909864 C T 1 2 0 1 0 0 0
rs1890338 T C 1 2 0 1 0 0 0
rs58311021 A G 0 0.621979 0 0 0 0 0
rs12039132 T C 1 1.97199 0 1 0 0 0
rs7514166 C G 1 2 0 1 0 0 0
rs34408665 C T 1 2 0 1 0 0 0
rs58052947 C T 1 2 0 1 0 0 0
rs56398543 A G 1 2 0 1 0 0 0
rs6661908 A G 1 2 0 1 0 0 0
rs41334845 T G 1 0.269012 0 1 0 0 0
rs6679678 C T 1 2 0 1 0 0 0
rs12042403 C T 0.45401 1.99799 0 1 0 0 0
rs12041583 G A 0.45401 1.99799 0 1 0 0 0
rs12041634 T A 1 2 0 1 0 0 0
rs12040299 T C 1 2 0 1 0 0 0
rs10489589 C T 1 2 0 1 0 0 0
rs72629493 A G 1 2 0 1 0 0 0
rs72629494 C T 1 2 0 1 0 0 0
rs12408197 T C 1 2 0 1 0 0 0
rs67025412 C T 1 2 0 1 0 0 0
rs10489588 A G 1 2 0 1 0 0 0
rs67630839 T C 1 2 0 1 0 0 0
rs78841195 G A 0 0 0 0 0 0 0
rs1005471 C T 1 2 0 1 0 0 0
rs9660819 T C 1 2 0 1 0 0 0
rs9661525 G T 1 2 0 1 0 0 0
rs72629496 A G 1 0.269012 0 1 0 0 0
rs16823392 G C 1 2 0 1 0 0 0
rs116556794 T C 0 0 0 0 0 0 0
rs74048148 G A 1 1.99701 0 1 0 0 0
rs77415878 G C 1 1.99701 0 1 0 0 0
rs113272167 TC 1 1.99701 0 1 0 0 0
rs112690788 G A 1 1.99701 0 1 0 0 0
rs72856433 A G 1 1.99701 0 1 0 0 0
rs72629497 C T 1 2 0 1 0 0 0
rs67872056 TC 1 2 0 1 0 0 0
rs72629498 T C 1 1.35199 0 1 0 0 0
rs67289919 A G 1 2 0 1 0 0 0
rs67919793 C T 1 2 0 1 0 0 0
rs67908307 C T 1 2 0 1 0 0 0
rs75400926 C T 0.877991 1.82599 0 0.928009 0 0 0
rs12036378 A G 1 2 0.00100708 1 0 0 0
rs55863899 G C 1 0.268005 0 1 0 0 0
rs74048152 T G 0 0.64801 0 0 0 0 0
rs66693767 G A 1 2 0 1 0 0 0
rs10909865 G A 1 2 0.00299072 1 0 0 0
rs72856440 G A 1 2 0 1 0 0 0
rs2993510 C T 1 2 0 1 0 0 0
rs113884577 A G 0.0180054 0 0.0039978 0.0239868 0.0039978 0 0
rs12567620 G T 1 2 0 1 0 0 0
rs12566487 A G 0 0.64801 0 0 0 0 0
rs12409277 C T 1 2 0 1 0 0 0
rs12044162 G A 1 1.992 0 1 0 0 0
rs12724379 A C 0.00100708 0 0.312988 0.0039978 0.00100708 0 0
rs6684233 A C 0 0.650024 0 0 0 0 0
rs6692292 C T 1 1.93701 0 1 0.00201416 0 0
rs56213325 A G 0.0039978 0 0 0.0190125 0.00100708 0.00100708 0.00201416
rs12045368 G A 1 1.28 0 1 0.00201416 0 0
chr1:2960196:I TCAAACAAA T 1 1.931 0 0.98999 0.00201416 0 0
rs10909866 T C 1 1.92801 0 1 0.00201416 0 0
rs12046233 G A 1 1.28 0 1 0.00201416 0 0
rs12044848 T C 1 1.92499 0 1 0.00201416 0 0
rs12120940 A G 0.0150146 0.0180054 0.0710144 0.0299988 0.0100098 0.0820007 0.00900269
rs2651900 T G 0.388 0.0039978 0.665985 0.230011 0.600006 0 0.21698
rs2075968 T C 0.442993 0 0.312988 0.756989 0.320007 0 0.888001
rs1627080 C T 1.00201 0.0969849 1.009 1.008 1 0.00799561 2
rs116609258 A C 0.0820007 0 0.556 0.00900269 0.0830078 0 0.019989
rs117719590 C T 0.0830078 0 0.0230103 0.0100098 0.0710144 0 0.871978
rs2817128 C T 2 2 2 2 1.99799 2 1.91199
rs2817129 C G 1.68399 1.923 1.327 1.867 1.48898 1.99399 1.84402
rs79551700 T C 0.07901 0.00100708 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
rs35019643 G C 0.07901 0.00100708 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
chr1:3089167:I CG C 0.07901 0.00100708 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
rs34503920 T C 0.07901 0.00201416 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
rs2455110 A G 0.132995 0.0329895 0.0870361 0.0930176 0.510986 0.137024 0.052002
rs56788035 A G 0.0700074 0.00799561 0.0629883 0.0209961 0.427032 0.0799866 0.0409851
rs2455109 TA 0.15799 0.0549927 0.0950317 0.557984 0.529023 0.150024 0.10199
rs11807062 C A 0.15799 0.332031 0.0940246 0.543 0.526001 0.150024 0.0969849
rs60951434 A G 0.0700074 0.00799561 0.0629883 0.0220032 0.485992 0.0799866 0.0409851
rs113002511 G A 0.00100708 0.00799561 0.00601196 0.0100098 0.0100098 0.0100098 0.00201416
rs4648372 G A 0.156983 0.32901 0.0950317 0.55603 0.473999 0.0809937 0.0960083
rs4648373 T C 0.156983 0.32901 0.0940246 0.548981 0.466003 0.0809937 0.0549927
rs10909907 G A 0.165009 0.32901 0.0950317 0.55603 0.843994 0.0849915 0.0950012
rs2249336 C T 0.132995 0.307984 0.0910339 0.107971 0.46698 0.0700073 0.0539856
rs2249319 T C 0.0710144 0.00799561 0.0650024 0.0239868 0.437988 0.0150146 0.0400085
rs2249318 T C 0.0710144 0.00799561 0.0650024 0.0239868 0.437988 0.0150146 0.0400085
rs2455112 G A 0.413971 0.323975 0.36502 0.163025 0.5 0.369018 1.13498
rs12755504 G A 0.0419922 0.00799561 0.0599976 0.026001 0.378021 0.0130005 0.0390015
rs12071366 G C 1 0 0.00100708 0 0 0 0
rs6424052 G A 0 0.998993 0.0509949 1.98901 0 0 0.993988
rs7543006 G C 1 0 0 0 0 0 0
rs56710549 G C 0 0 0.234985 0 0 0 0
rs140255410 A G 0 0.00601196 0.0130005 0 0.0190125 0.00201416 0.00299072
rs111484302 G A 0 0 0.467987 0 0 0 0
rs9424286 T G 0.0100098 0 0.0379944 0 0 0 0.00299072
rs112460330 C T 0 0 0 0 0 0 0
chr1:3716325:I GCCCTGCACCC G 0.00100708 0 0 0.0039978 0 0 0
rs9424299 G A 1.96399 1.72199 1.73801 2 0.850006 0.98703 1.85998
rs60615207 A G 0.0509949 0.00698853 0.237 0.0180054 0.0249939 0 0.0339966
rs77743432 A C 0.473999 0 0.00100708 0 0 0 0
【问题讨论】:
-
您应该扩展更多关于“键变量”行为应该如何工作的细节。暂时,我可能会使用 perl 和
split将列分解为数组,然后比较。但细节?模糊问题得到模糊答案 -
我已经改了。感谢您的反馈。
-
编辑您的问题以包含一些简洁、可测试的样本输入和预期输出,以消除一些猜测。 sed 和 awk 是两个完全不同的工具,没有
awk/sed就像没有perl/grep一样。这根本不是 sed 的工作,没有理由假设 awk 解决方案会比 perl 或 python 慢。 -
我已经编辑了帖子/问题。我已经包含了 file1 和 file2 的
head,以及预期的结果。我希望这是有益的,可以帮助我而不是搁置这个问题。我知道这个问题可能很宽泛,但是请尝试使用慈善原则:假设您是一名在脚本方面受过零教育的生物学家,试图通过向(更多)受过脚本教育的人寻求帮助来找到解决此问题的方法.我认为这将有很长的路要走。 @thissuitisblacknot @andras-deak @mseifert @prasun @mpromonet。