【问题标题】:How to use regex in Hive to parse Apache log time stamp?如何在 Hive 中使用正则表达式来解析 Apache 日志时间戳?
【发布时间】:2016-04-27 19:19:39
【问题描述】:
我的日志文件记录如下:
107.344.154.200 - - [23/Aug/2005:00:03:14 -0400] "GET /images/theimage.gif HTTP/1.0" 200 11401
我有这个语法来解析日志文件
创建表日志文件(
主机 STRING,
身份字符串,
用户 STRING,
时间字符串,
请求字符串,
状态字符串,大小
细绳 )
行格式 SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ( "input.regex" = "([^ ]) ([^ ]) ([^ ])
(-|\[[^\]]\]) ([^ \"]|\"[^\"]\") (-|[0-9]) (-|[0-9])",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" ) 已存储
作为文本文件;
我可以使用什么正则表达式语法来解析将 [23/Aug/2005:00:03:14 -0400] 按日月年分秒分割的时间?
【问题讨论】:
-
只是一个友好的提示,您可能需要阅读此页面:The How-To-Ask Guide,这样您就可以始终确保您的问题易于回答且尽可能清晰。请务必包括您为解决遇到的问题所做的任何努力,以及尝试这些修复时发生的情况。也不要忘记您的显示代码和任何错误消息!
标签:
regex
logging
hive
timestamp
【解决方案1】:
说明
此正则表达式将执行以下操作:
- 解析日志条目并查找日期和时间
- 捕获各种日期部分,如日、月、年、小时、分钟、秒、UTC 偏移量
正则表达式
\[(\d{2})/([a-zA-Z]{3})/(\d{4}):(\d{2}):(\d{2}):(\d{2})\s(-\d{4})]
请注意,根据语言的不同,您可能必须通过将 / 替换为 \/ 来转义它们。但是语言是不同的。
说明
NODE EXPLANATION
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[a-zA-Z]{3} any character of: 'a' to 'z', 'A' to 'Z'
(3 times)
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \6:
----------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
----------------------------------------------------------------------
) end of \6
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
( group and capture to \7:
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of \7
----------------------------------------------------------------------
] ']'
----------------------------------------------------------------------
示例文本
107.344.154.200 - - [23/Aug/2005:00:03:14 -0400] "GET /images/theimage.gif HTTP/1.0" 200 11401
现场演示
https://regex101.com/r/hF4fP8/1
示例匹配
[0][0] = [23/Aug/2005:00:03:14 -0400]
[0][1] = 23
[0][2] = Aug
[0][3] = 2005
[0][4] = 00
[0][5] = 03
[0][6] = 14
[0][7] = -0400