【发布时间】:2021-12-25 03:41:05
【问题描述】:
我有一个包含大约 1500 万条记录的文件。以下是数据示例
99001597,555555555555,3211,Njro_Kaniani,test,NORTH,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,IN2017,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001679,555555555555,1756,Bnju_HTT,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2012,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001680,555555555555,1108,Temoni_Kiara,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2028,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001683,555555555555,1604,Blue_Bay,Nzindo,,Y,COAST,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1820,Sgerea_Makuka,Salaam,,N,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1184,Makka,Salaam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1381,Leaders_Club,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1037,Mbez,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1041,Ngano,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1313,Kichangani,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001684,555555555555,4975,Nyugusu Campp2,Test,test,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,1041,Ngano,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,1420,Airport_Macro,Salaam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,3147,Technical_Nzoti,test,ORTH,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,4488,Lumala,Mwnza,,Y,Nyeka,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,4975,Nyarugusu Campp2,Kigoma,,Y,Nyeka,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google
我正在使用以下脚本来计算符合某些条件的行的出现次数。问题是这个脚本很慢。我一天得到大约 200 条输出线。 目前,我的程序将读取 1500 万条记录文件 36,000 次。这是非常低效的(慢!!)。我怎样才能修改我的脚本,只读取一次非常大的文件?
期望的输出
1037,0,0,1,1,1,1,1,1,1,1,1,1
1041,0,0,2,2,2,2,2,2,2,2,2,2
1108,0,0,1,1,1,1,1,1,1,1,1,1
1184,0,0,1,1,1,1,1,1,1,1,1,1
1313,0,0,1,1,1,1,1,1,1,1,1,1
1381,0,0,1,1,1,1,1,1,1,1,1,1
1420,0,0,1,1,1,1,1,1,1,1,1,1
1604,0,0,1,1,1,1,1,1,1,1,1,1
1756,0,0,1,1,1,1,1,1,1,1,1,1
1820,0,0,1,1,1,1,1,1,1,1,1,1
3147,0,0,1,1,1,1,1,0,0,0,0,1
3211,0,0,1,1,1,1,1,0,0,0,0,1
4488,0,0,1,1,1,1,1,1,1,1,1,1
4975,0,0,2,2,2,2,2,1,1,0,0,1
IDs_file 文件包含大约 3000 条记录,每条记录都有 4 位数字
while read i
do
twog=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if ((($10 == "Yes")||($10 == "No")) && ($3 == src) && ($9 == "No")&& ($11 == "No")) print $0;}'|wc -l)
threeg=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F,'{if (($3 == src) &&($9 == "Yes")&& ($11 == "No")) print $0;}'|wc -l)
fourg=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F,'{if (($11 == "Yes") && ($3 == src)) print $0;}'|wc -l)
lte2100=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($13 == "Yes") && ($3 == src)) print $0;}'|wc -l)
lte800=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($14 == "Yes") && ($3 == src)) print $0;}'|wc -l)
lte700=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($15 == "Yes") && ($3 == src)) print $0;}'|wc -l)
lte1800=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($16 == "Yes") && ($3 == src)) print $0;}'|wc -l)
lte2600=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($17 == "Yes") && ($3 == src)) print $0;}'|wc -l)
lte900=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($18 == "Yes") && ($3 == src)) print $0;}'|wc -l)
threeg2100=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($24 == "Yes") && ($3 == src)) print $0;}'|wc -l)
threeg900=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($25 == "Yes") && ($3 == src)) print $0;}'|wc -l)
volte=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (($23 == "Yes") && ($3 == src)) print $0;}'|wc -l)
echo $i,$twog,$threeg,$fourg,$lte2100,$lte800,$lte700,$lte1800,$lte2600,$lte900,$threeg2100,$threeg900,$volte>>Raw_data_for_report.csv
done < IDs_file
【问题讨论】:
-
I have a file containing about 15 million records不要使用 shell 来完成这个任务。而是解释目标是什么。期望的结果/输出是什么? -
1500 万条记录文件是叫
IDSs_file还是combined_marketing_thing?另一个文件里有什么? -
方法 1:将数据导入 SQLite 或其他数据库,添加适当的索引,然后进行查询。方法 2:重写为只在循环内运行一次 awk,而不是 12 次(并摆脱 Cat 和 wc 的无用用法,并将输出重定向移动 outside 循环)拍摄,你可以可能在 awk 中一次性完成这两个文件,不需要 shell。
-
combined_marketing_sadm_report.csv 中有 1500 万条记录,IDS_file 包含大约 3000 条记录,每条记录为 4 位数字
-
一个
awk脚本;第一个文件(IDs_file)被加载到数组中,对于第二个文件(combined_marketing_sadm_report.csv)中的每条记录,检查$3 in array,如果是,则根据各种字段检查增加一组计数器数组(例如,if (...) twog[ID]++; if(...) four[ID]++; ...);这将需要 single 遍历每个文件;END {...}块包含一个for循环以将数组打印到标准输出
标签: arrays linux bash perl awk