如果你能给出你目前拥有的代码,那就更好了。
方法一:按空值过滤
-- load comma deliited values into columns
A = load './input.txt' using PigStorage(',') as (one:chararray, two:chararray, three:chararray, four:chararray);
dump A;
-- remove records where columns are null
B = FILTER A BY (one is not null) OR (two is not null) OR (three is not null) OR (four is not null);
dump B;
这里假设 input.txt 如下。
a,b,c,d
g,b,v,n
n,h,l,o
,,,
,,,
,,,
,,,
运行命令:
pig -x local clean.pig
输出第一个转储:
(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
(,,,)
(,,,)
(,,,)
(,,,)
输出第二次转储:
(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
方法二:按列大小过滤
-- load comma deliited values into columns
A = load './input.txt' using PigStorage(',') as (one:chararray, two:chararray, three:chararray, four:chararray);
dump A;
-- generate column count
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
dump B;
-- filter by column count
C = FILTER B BY $0 > 0;
dump C;
-- remove column count
D = FOREACH C GENERATE $1..;
dump D;
转储 A 的输出:
(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
(,,,)
(,,,)
(,,,)
(,,,)
转储 B 的输出:
(4,a,b,c,d)
(4,g,b,v,n)
(4,n,h,l,o)
(0,,,,)
(0,,,,)
(0,,,,)
(0,,,,)
转储 C 的输出:
(4,a,b,c,d)
(4,g,b,v,n)
(4,n,h,l,o)
转储 D 的输出:
(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
附注:
如果您的输入文件最初有括号,您可能需要单独处理。