【问题标题】:Log analysis with Apache Pig使用 Apache Pig 进行日志分析
【发布时间】:2013-12-19 07:58:32
【问题描述】:

我有这些行的日志:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

其中第一列 (in24.inetnebr.com) 是主机,第二列 (01/Aug/1995:00:00:01 -0400) 是时间戳,第三列 (GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0) 是下载页面。

如何使用 Pig 找到每个主机的最后两个下载页面?

非常感谢您的帮助!

【问题讨论】:

  • 我取得了一个小小的进展,现在我有了 (casted, the date is date): (host, date, address) 的行,如何为每个主机选择最后两个地址?谢谢提前。

标签: hadoop apache-pig log-analysis


【解决方案1】:

我已经解决了这个问题,仅供参考:

REGISTER piggybank.jar
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

raw = LOAD 'nasa' USING org.apache.hcatalog.pig.HCatLoader(); --cast the data, to make possible the usage of string functions

rawCasted = FOREACH raw GENERATE (chararray)host as host, (chararray)xdate as xdate,(chararray)address as address; --cut out the date, and put together the used columns

rawParsed = FOREACH rawCasted GENERATE host, SUBSTRING(xdate,1,20) as xdate, address; --make sure that the not full columns are omitted

rawFiltered = FILTER rawParsed BY xdate IS NOT NULL; --cast the timestamp to timestamp format

analysisTable = FOREACH rawFiltered GENERATE host, ToDate(xdate, 'dd/MMM/yyyy:HH:mm:ss') as xdate, address;

aTgrouped = GROUP analysisTable BY host;

resultsB = FOREACH aTgrouped {
elems=ORDER analysisTable BY xdate DESC;
two=LIMIT elems 2; --Choose the last two page

fstB=ORDER two BY xdate DESC;
fst=LIMIT fstB 1; --Choose the last page

sndB=ORDER two BY xdate ASC;
snd=LIMIT sndB 1; --Choose the previous page

GENERATE FLATTEN(group), fst.address, snd.address; --Put together the pages
};
DUMP resultsB;

【讨论】:

  • 我已经对这个 NASA 数据集进行了 4 次分析(两次使用 Pig,两次使用 Hive),我可以提供数据集的链接和其他 3 次分析的代码,如果有人感兴趣。
  • 能否提供分析链接?
猜你喜欢
  • 2012-05-11
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-09-13
  • 1970-01-01
  • 2016-05-13
  • 2021-01-27
  • 1970-01-01
相关资源
最近更新 更多