【问题标题】:Apache Pig - Not able to read the bagApache Pig - 无法读取包
【发布时间】:2016-02-25 05:27:50
【问题描述】:

我正在尝试使用 PIG 读取逗号分隔的数据,如下所示:

grunt> cat script/pig/emp_tuple1.txt
1,kirti,250000,{(100),(200)}
2,kk,240000,{(100),(300)}
3,kumar,200000,{(200),(400)}
4,shinde,290000,{(200),(500),(300),(100)}
5,shinde k y,260000,{(100),(300),(200)}
6,amol,255000,{(300)}
grunt> emp_t1 = load 'script/pig/emp_tuple1.txt' using PigStorage(',') as (empno:int, ename:chararray, salary:int, dlist:bag{});
grunt> dump emp_t1;
2015-11-23 12:26:44,450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,kirti,250000,)   
(2,kk,240000,)
(3,kumar,200000,)
(4,shinde,290000,)
(5,shinde k y,260000,)
(6,amol,255000,{(300)})

中间显示警告为:

2015-11-23 12:26:44,173 [LocalJobRunner Map Task Executor #0] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.Utf8StorageConverter(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to interpret value [123, 40, 49, 48, 48, 41] in field being converted to type bag, caught ParseException <Unexpect end of bag> field discarded

遇到包里的逗号(,)好像是在显示警告。

现在我所做的是:将逗号更改为制表符(或任何其他分隔符)并且它起作用了:

grunt> cat script/pig/emp_tuple2.txt;
1|kirti|250000|{(100),(200)}
2|kk|240000|{(100),(300)}
3|kumar|200000|{(200),(400)}
4|shinde|290000|{(200),(500),(300),(100)}
5|shinde k y|260000|{(100),(300),(200)}
6|amol|255000|{(300)}
grunt> emp_t2 = load 'script/pig/emp_tuple2.txt' using PigStorage('|') as (empno:int, ename:chararray, salary:int, dlist:bag{});
grunt> dump emp_t1;
2015-11-23 12:31:33,408 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,kirti,250000,{(100),(200)})
(2,kk,240000,{(100),(300)})
(3,kumar,200000,{(200),(400)})
(4,shinde,290000,{(200),(500),(300),(100)})
(5,shinde k y,260000,{(100),(300),(200)})
(6,amol,255000,{(300)})

所以我只是想知道您是否有逗号分隔的数据,袋子用逗号分隔,它不会工作吗?

【问题讨论】:

    标签: hadoop apache-pig bag


    【解决方案1】:
    Lets go into details, 
     1. Data is being read as TextInputFormat 
     2. Line Record Reader is being used to read lines
     3. , is being used to separate columns. 
    
    as "," occurs in the bag and is the delimeter across columns, bag is being split into multiple columns. 
    
    There are various way to overcome this. 
    
     1. pre-process the input and replace first three "," in each row by some other delimeter. 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-07-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-09-07
      • 1970-01-01
      相关资源
      最近更新 更多