【问题标题】:Parsing server logs Python解析服务器日志 Python
【发布时间】:2018-09-18 12:55:04
【问题描述】:

我一直在努力尝试使用 Python 和正则表达式解析一些服务器日志。我希望能够解析这些行中的用户代理字符串,然后最终将它们放入 Pandas 数据框或简单的 Excel 电子表格中。

所以下面的摘录:

14/Aug/2018:00:44:50 +0000] 330 95.144.101.0, 34.255.205.1  GET pixelg.adswizz.com  /one.png    200 -   AlexaMediaPlayer/2.0.201528.0 (Linux;Android 5.1.1) ExoPlayerLib/1.5.9  client=VillaPlus&oid=2069&cid=22599&ad=54063&cr=August2018&target=25plus&action=ae&eventId=&cb=8874209&listenerId=f78d5ea146e92c4666efd2a389a8d2e8f6174bfc6777496e5e22735c426c&zone=679 -   pixelg.adswizz.com  https   533 TLSv1.2 DHE-RSA-AES128-GCM-SHA256,
15/Aug/2018:23:03:17 +0000] 330 79.77.250.195, 34.245.112.20    GET pixelg.adswizz.com  /one.png    200 -   Smooth/38 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X)  devicemap=mobile_tablet -   pixelg.adswizz.com  http    357 -   -   0.000,
15/Aug/2018:23:17:01 +0000] 330 77.100.181.37   GET pixelg.adswizz.com  /one.png    200 https://www.bonne-terre-data-layer.com/tag-manager.html?consumer=m.skybet.com   Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 SkyBet/6.8b474 (Sky Bet Mobile App)  client=SkyBet&event_id=Summer17&action=clientsitevisit&event=/my-bets   -   pixelg.adswizz.com  https   605 TLSv1.2 DHE-RSA-AES128-GCM-SHA256   0.000,
14/Aug/2018:01:00:55 +0000] 330 86.178.205.6, 34.244.204.228    GET pixelg.adswizz.com  /one.png    200 -   Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 client=MyHermes&oid=&cid=22731&ad=54477&cr=Hermes&target=selfemploy&action=ae&eventId=&cb=7699694&listenerId=0610d2ed750ab9f692daff922e1b2c04&zone=87   -   pixelg.adswizz.com  https   546 TLSv1.2 DHE-RSA-AES128-GCM-SHA256

成为一个列表:

AlexaMediaPlayer/2.0.201528.0 (Linux;Android 5.1.1) ExoPlayerLib/1.5.9,
Smooth/38 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X),
Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 SkyBet/6.8b474 (Sky Bet Mobile App),
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 

具体来说,我一直坚持如何创建正则表达式来获取这些不同行格式的用户代理字符串。我希望代码看起来像这样:

import re
listofLines = ["[14/Aug/2018:00:44:50 +0000]    330 95.144.101.0, 34.255.205.1  GET pixelg.adswizz.com  /one.png    200 -   AlexaMediaPlayer/2.0.201528.0 (Linux;Android 5.1.1) ExoPlayerLib/1.5.9  client=VillaPlus&oid=2069&cid=22599&ad=54063&cr=August2018&target=25plus&action=ae&eventId=&cb=8874209&listenerId=f78d5ea146e92c4666efd2a389a8d2e8f6174bfc6777496e5e22735c426c&zone=679 -   pixelg.adswizz.com  https   533 TLSv1.2 DHE-RSA-AES128-GCM-SHA256",
               "[15/Aug/2018:23:03:17 +0000]    330 79.77.250.195, 34.245.112.20    GET pixelg.adswizz.com  /one.png    200 -   Smooth/38 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X)  devicemap=mobile_tablet -   pixelg.adswizz.com  http    357 -   -   0.000",
               "[15/Aug/2018:23:17:01 +0000]    330 77.100.181.37   GET pixelg.adswizz.com  /one.png    200 https://www.bonne-terre-data-layer.com/tag-manager.html?consumer=m.skybet.com   Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 SkyBet/6.8b474 (Sky Bet Mobile App)  client=SkyBet&event_id=Summer17&action=clientsitevisit&event=/my-bets   -   pixelg.adswizz.com  https   605 TLSv1.2 DHE-RSA-AES128-GCM-SHA256   0.000",
               "[14/Aug/2018:01:00:55 +0000]    330 86.178.205.6, 34.244.204.228    GET pixelg.adswizz.com  /one.png    200 -   Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 client=MyHermes&oid=&cid=22731&ad=54477&cr=Hermes&target=selfemploy&action=ae&eventId=&cb=7699694&listenerId=0610d2ed750ab9f692daff922e1b2c04&zone=87   -   pixelg.adswizz.com  https   546 TLSv1.2 DHE-RSA-AES128-GCM-SHA256"]

regexuseragent = r"[200 |200    -   ]"


for line in listofLines:
    if re.findall(regexuseragent,line):
        print(regexuseragent)


    else: print("no useragent")

【问题讨论】:

  • 看起来你可以在电子表格中打开日志文件并选择“TAB”作为分隔符

标签: python database pandas


【解决方案1】:

不是每个字符串处理问题都是正则表达式问题。

您的输入行似乎是制表符分隔的。在选项卡上拆分并获取您想要的任何索引,例如

agent_string = line.split("\t")[8]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2011-01-22
    • 1970-01-01
    • 1970-01-01
    • 2016-06-17
    • 1970-01-01
    • 2014-11-01
    • 1970-01-01
    • 2020-05-21
    相关资源
    最近更新 更多