使用 Logstash、ElasticSearch 和 Kibana 处理 Warc 文件答案

【问题标题】：Processing a Warc File using Logstash, ElasticSearch, and Kibana使用 Logstash、ElasticSearch 和 Kibana 处理 Warc 文件
【发布时间】：2016-11-21 08:01:47
【问题描述】：

我想使用 LogStash 解析 WARC 文件。我想将输入提供给 ElasticSearch，以便我可以使用 Kibana 将其可视化。我试过这个：

input {
  file {
    path => "/tmp/access_log"
    start_position => "beginning"
  }
}

filter {
  if [path] =~ "access" {
    mutate { replace => { "type" => "apache_access" } }
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }
  date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
  }
  stdout { codec => rubydebug }
}

这有助于获取 apache 日志并显示它。我想知道如何使用 WARC 文件并使用 Kibana 将其可视化。
这是我想输入的示例 WARC 文件。

WARC/0.17
WARC-Type: metadata
WARC-Target-URI: http://www.archive.org/robots.txt
WARC-Date: 2008-04-30T20:48:25Z
WARC-Concurrent-To: <urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>
WARC-Record-ID: <urn:uuid:545709ad-90c5-4c08-9eed-092bdf2e33a7>
Content-Type: text/anvl
Content-Length: 66

via: http://www.archive.org/
hopsFromSeed: P
fetchTimeMs: 47



WARC/0.17
WARC-Type: response
WARC-Target-URI: http://www.archive.org/
WARC-Date: 2008-04-30T20:48:26Z
WARC-Payload-Digest: sha1:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV
WARC-IP-Address: 207.241.229.39
WARC-Record-ID: <urn:uuid:4042c21b-d898-43f0-9c95-b50da2d1aa42>
Content-Type: application/http; msgtype=response
Content-Length: 680

HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:25 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT
ETag: "47ac-16e-4f9e5b40"
Accept-Ranges: bytes
Content-Length: 366
Connection: close
Content-Type: text/html; charset=UTF-8

<html>
<head>
<meta http-equiv="Refresh" content="0;URL=http://www.archive.org/index.php"/>
<script>
document.location="http://www.archive.org/index.php";
</script>
</head>
<body>
<img width="70" height="56" src="http://www.archive.org/images/logoc.jpg"/><br/>
Please visit our website at:
<a href="http://www.archive.org">http://www.archive.org</a>
</body>
</html>

这是文件的完整示例：Sample WARC Text in Text File Format
希望尽快收到您的来信。如果我能解决这个问题，我会很高兴。

【问题讨论】：

我认为logstash 完全不适合这项工作。 Logstash 用于处理日志文件或每行具有相同格式的文件，而不是类似的文件。或者，如果您想从某些行中提取信息，您可以指出您想要的信息。
@baudsp 感谢您的 cmets。但是让我告诉你，如果我想将批量信息插入到 elasticsearch 中，那么我认为 logstash 很方便。我在互联网上看到了很多与此相关的视频。但是我的问题仍然没有解决方案。
“我遇到的问题。”哪个是？从您的问题中不清楚您想从文件中提取什么。
@baudsp 大多数情况下我想提取 URL、日期和连接。就是这样。如果我可以提取这个，那么我可以提取任何东西
如果是这样的话，说不定是有可能的。您能否在您的示例（问题中的示例或链接示例）中指出您希望获得的值的确切文本？由于示例中有多个 URL 和多个日期，因此我希望在进行配置之前确定您想要什么。

标签： java elasticsearch logstash kibana

【解决方案1】：

此过滤器将仅保留带有“^WARC-Target-URI”或“^HTTP/1.1”或“^Date:”的行，然后从这些行中提取信息。

input {
    file {
        path => "/tmp/access_log"
        start_position => "beginning"
    }
}


filter {
    if [message] !~ "^WARC-Target-URI" and [message] !~ "^HTTP\/1.1" and [message] !~ "^Date: " {
        drop {}
    }

    grok {
        match => {
            "message" => ["Date: %{GREEDYDATA:date}", "WARC-Target-URI: %{GREEDYDATA:url}", "HTTP/1.1 %{NUMBER:response}"]
        }
    }

    # For "Wed, 30 Apr 2008 20:48:25 GMT"
    date {
        match => ["date", "EEE, dd MMM YYYY HH:mm:ss ZZZ"]
        target => "date"
        locale => "en"
    }
}

output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "webinfo"
    }
}

从示例文件中，它将在 Elasticsearch 中插入以下 json 文档：

{"message":"WARC-Target-URI: http://www.archive.org/robots.txt","@version":"1","@timestamp":"2016-11-22T12:55:48.151Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/robots.txt"}
{"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.151Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"}
{"message":"HTTP/1.1 200 OK","@version":"1","@timestamp":"2016-11-22T12:55:48.167Z","path":"D:\\better.txt","host":"FREIFDKT0021127","response":"200"}
{"message":"Date: Wed, 30 Apr 2008 20:48:25 GMT","@version":"1","@timestamp":"2016-11-22T12:55:48.167Z","path":"D:\\better.txt","host":"FREIFDKT0021127","date":"2008-04-30T20:48:25.000Z"}
{"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"}
{"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"}
{"message":"WARC-Target-URI: http://www.archive.org/index.php","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/index.php"}
{"message":"HTTP/1.1 200 OK","@version":"1","@timestamp":"2016-11-22T12:55:48.198Z","path":"D:\\better.txt","host":"FREIFDKT0021127","response":"200"}
{"message":"Date: Wed, 30 Apr 2008 20:48:25 GMT","@version":"1","@timestamp":"2016-11-22T12:55:48.198Z","path":"D:\\better.txt","host":"FREIFDKT0021127","date":"2008-04-30T20:48:25.000Z"}

【讨论】：

非常感谢。但有几个问题。希望你能告诉我。我想知道输入是否是文件（没有 JSON），这个输出是否会写在带有索引 webinfo 的弹性搜索上？如果您可以编辑并在答案中显示，我将很高兴。再一次感谢你。请让我知道这个 GREEDYDATA 是什么？我第一次遇到这个。
我正在尝试在 elasticsearch 中插入此输出，但它对我显示错误。
请告诉我如何将结果放入弹性搜索中。我试过但有错误。我认为有一些错误需要检查。是否要求我们应该有一个已经可用的索引并在其中插入适当的数据类型？
@JafferWilson 我添加了输入和输出。请注意，elasticsearch conf 适用于在本地运行 Elasticsearch 的 Logstash 2+。 grok filter 使用正则表达式来提取字段的一部分（这里是字段消息）。为此，它使用与正则表达式等效的模式（例如 GREEDYDATA = .*，请参阅github.com/logstash-plugins/logstash-patterns-core/blob/master/…）。
感谢您的澄清。我会查一下。我想在写日志的时候，我犯了错误，我会使用你的代码。