【发布时间】:2018-09-18 12:55:04
【问题描述】:
我一直在努力尝试使用 Python 和正则表达式解析一些服务器日志。我希望能够解析这些行中的用户代理字符串,然后最终将它们放入 Pandas 数据框或简单的 Excel 电子表格中。
所以下面的摘录:
14/Aug/2018:00:44:50 +0000] 330 95.144.101.0, 34.255.205.1 GET pixelg.adswizz.com /one.png 200 - AlexaMediaPlayer/2.0.201528.0 (Linux;Android 5.1.1) ExoPlayerLib/1.5.9 client=VillaPlus&oid=2069&cid=22599&ad=54063&cr=August2018&target=25plus&action=ae&eventId=&cb=8874209&listenerId=f78d5ea146e92c4666efd2a389a8d2e8f6174bfc6777496e5e22735c426c&zone=679 - pixelg.adswizz.com https 533 TLSv1.2 DHE-RSA-AES128-GCM-SHA256,
15/Aug/2018:23:03:17 +0000] 330 79.77.250.195, 34.245.112.20 GET pixelg.adswizz.com /one.png 200 - Smooth/38 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) devicemap=mobile_tablet - pixelg.adswizz.com http 357 - - 0.000,
15/Aug/2018:23:17:01 +0000] 330 77.100.181.37 GET pixelg.adswizz.com /one.png 200 https://www.bonne-terre-data-layer.com/tag-manager.html?consumer=m.skybet.com Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 SkyBet/6.8b474 (Sky Bet Mobile App) client=SkyBet&event_id=Summer17&action=clientsitevisit&event=/my-bets - pixelg.adswizz.com https 605 TLSv1.2 DHE-RSA-AES128-GCM-SHA256 0.000,
14/Aug/2018:01:00:55 +0000] 330 86.178.205.6, 34.244.204.228 GET pixelg.adswizz.com /one.png 200 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 client=MyHermes&oid=&cid=22731&ad=54477&cr=Hermes&target=selfemploy&action=ae&eventId=&cb=7699694&listenerId=0610d2ed750ab9f692daff922e1b2c04&zone=87 - pixelg.adswizz.com https 546 TLSv1.2 DHE-RSA-AES128-GCM-SHA256
成为一个列表:
AlexaMediaPlayer/2.0.201528.0 (Linux;Android 5.1.1) ExoPlayerLib/1.5.9,
Smooth/38 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X),
Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 SkyBet/6.8b474 (Sky Bet Mobile App),
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
具体来说,我一直坚持如何创建正则表达式来获取这些不同行格式的用户代理字符串。我希望代码看起来像这样:
import re
listofLines = ["[14/Aug/2018:00:44:50 +0000] 330 95.144.101.0, 34.255.205.1 GET pixelg.adswizz.com /one.png 200 - AlexaMediaPlayer/2.0.201528.0 (Linux;Android 5.1.1) ExoPlayerLib/1.5.9 client=VillaPlus&oid=2069&cid=22599&ad=54063&cr=August2018&target=25plus&action=ae&eventId=&cb=8874209&listenerId=f78d5ea146e92c4666efd2a389a8d2e8f6174bfc6777496e5e22735c426c&zone=679 - pixelg.adswizz.com https 533 TLSv1.2 DHE-RSA-AES128-GCM-SHA256",
"[15/Aug/2018:23:03:17 +0000] 330 79.77.250.195, 34.245.112.20 GET pixelg.adswizz.com /one.png 200 - Smooth/38 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) devicemap=mobile_tablet - pixelg.adswizz.com http 357 - - 0.000",
"[15/Aug/2018:23:17:01 +0000] 330 77.100.181.37 GET pixelg.adswizz.com /one.png 200 https://www.bonne-terre-data-layer.com/tag-manager.html?consumer=m.skybet.com Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 SkyBet/6.8b474 (Sky Bet Mobile App) client=SkyBet&event_id=Summer17&action=clientsitevisit&event=/my-bets - pixelg.adswizz.com https 605 TLSv1.2 DHE-RSA-AES128-GCM-SHA256 0.000",
"[14/Aug/2018:01:00:55 +0000] 330 86.178.205.6, 34.244.204.228 GET pixelg.adswizz.com /one.png 200 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 client=MyHermes&oid=&cid=22731&ad=54477&cr=Hermes&target=selfemploy&action=ae&eventId=&cb=7699694&listenerId=0610d2ed750ab9f692daff922e1b2c04&zone=87 - pixelg.adswizz.com https 546 TLSv1.2 DHE-RSA-AES128-GCM-SHA256"]
regexuseragent = r"[200 |200 - ]"
for line in listofLines:
if re.findall(regexuseragent,line):
print(regexuseragent)
else: print("no useragent")
【问题讨论】:
-
看起来你可以在电子表格中打开日志文件并选择“TAB”作为分隔符