【问题标题】:PIG Script to split large txt file into parts based on specified wordPIG 脚本根据特定单词将大型文本文件拆分为多个部分
【发布时间】:2015-10-26 21:24:29
【问题描述】:

我正在尝试构建一个猪脚本,该脚本接收教科书文件并将其划分为章节,然后比较每章中的单词并仅返回所有章节中出现的单词并计算它们。章节很容易被 CHAPTER - X 分隔。

这是我目前所拥有的:

lines = LOAD '../../Alice.txt' AS (line:chararray);
lineswithoutspecchars = FOREACH lines GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line; 
words = FOREACH lineswithoutspecchars GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

很抱歉,与我通常在 stackoverflow 上提出的问题相比,这个问题可能太简单了,我搜索了一下,但也许我没有使用正确的关键字。我是 PIG 的新手,正在尝试学习它以完成新的工作分配。

提前致谢!

【问题讨论】:

    标签: hadoop mapreduce apache-pig


    【解决方案1】:

    有点冗长,但你会得到结果。不过,您可以根据您的文件减少不必要的关系。在脚本中提供了适当的 cmets。

    输入文件:

    Pig does not know whether integer values in baseball are stored as ASCII strings, Java
    serialized values, binary-coded decimal, or some other format. So it asks the load func-
    tion, because it is that function’s responsibility to cast bytearrays to other types. In
    general this works nicely, but it does lead to a few corner cases where Pig does not know
    how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know
    how to perform casts on it because that bytearray is not generated by a load function.
    CHAPTER - X
    In a strongly typed computer language (e.g., Java), the user must declare up front the
    type for all variables. In weakly typed languages (e.g., Perl), variables can take on values
    of different type and adapt as the occasion demands.
    CHAPTER - X
    In this example, remember we are pretending that the values for base_on_balls and
    ibbs turn out to be represented as integers internally (that is, the load function con-
    structed them as integers). If Pig were weakly typed, the output of unintended would
    be records with one field typed as an integer. As it is, Pig will output records with one
    field typed as a double. Pig will make a guess and then do its best to massage the data
    into the types it guessed.
    

    猪脚本:

    A = LOAD 'file' as (line:chararray);
    B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line; 
    //we need to split on CHAPTER X but the above load function would give us a tuple for each newline. so
    group everything and convert that bag to string which will give a single tuple with _ as delimiter.
    C = GROUP B ALL; 
    D = FOREACH C GENERATE BagToString(B) as (line:chararray); 
    //now we dont have any commas so convert our delimiter CHAPTER X to comma. We do this becuase if we pass this
    to TOKENIZE it would split that into separarte column that would be useful to RANK it.
    E = FOREACH D GENERATE REPLACE(line,'_CHAPTER  X_',',') AS (line:chararray);
    F = FOREACH E GENERATE REPLACE(line,'_',' ') AS (line:chararray); //remove the delimiter created by BagToString
    //create separate columns
    G = FOREACH F GENERATE FLATTEN(TOKENIZE(line,',')) AS (line:chararray);
    //we need to rank each chapter so that would be easy when you are doing the count of each word.
    H = RANK G;
    J = FOREACH H GENERATE rank_G,FLATTEN(TOKENIZE(line)) as (line:chararray);
    J1 = GROUP J BY (rank_G, line);
    J2 = FOREACH J1 GENERATE COUNT(J) AS (cnt:long),FLATTEN(group.line) as (word:chararray),FLATTEN(group.rank_G) as (rnk:long); 
    //So J2 result will not have duplicate word within each chapter now.
    //So if we group it by word and then filter teh count of that by 2 we are sure that the word is present in all chapters.
    J3 = GROUP J2 BY word;
    J4 = FOREACH J3 GENERATE SUM(J2.cnt) AS (sumval:long),COUNT(J2) as (cnt:long),FLATTEN(group) as (word:chararray);
    J5 = FILTER J4 BY cnt > 2;
    J6 = FOREACH J5 GENERATE word,sumval;
    dump J6;
    //result in order word,count across chapters
    

    输出:

    (a,8)
    (In,5)
    (as,6)
    (the,9)
    (values,4)
    

    【讨论】:

    • 非常感谢您提供这个非常简洁的答案。这教会了我在 PIG 中的大量新技术,我能够进一步扩展解决方案,用它做一些额外的事情并学习更多。再次感谢。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2010-10-21
    • 2022-01-16
    • 2020-12-15
    • 2012-03-17
    • 1970-01-01
    • 2019-06-14
    相关资源
    最近更新 更多