在 Lua 中使用 LPeg 解析多行答案

【问题标题】：Parsing out multiple lines with LPeg in Lua在 Lua 中使用 LPeg 解析多行
【发布时间】：2013-10-24 23:35:23
【问题描述】：

我有一些带有多行块的文本文件，例如

2011/01/01 13:13:13,<AB>, Some Certain Text,=,
[    
certain text
         [
                  0: 0 0 0 0 0 0 0 0 
                  8: 0 0 0 0 0 0 0 0 
                 16: 0 0 0 9 343 3938 9433 8756 
                 24: 6270 4472 3182 2503 1768 1140 836 496 
                 32: 326 273 349 269 144 121 94 82 
                 40: 64 80 66 59 56 47 50 46 
                 48: 64 35 42 53 42 40 41 34 
                 56: 35 41 39 39 47 30 30 39 
                 Total count: 12345
        ]
    certain text
]
some text
2011/01/01 14:14:14,<AB>, Some Certain Text,=,
[
 certain text
   [
              0: 0 0 0 0 0 0 0 0 
              8: 0 0 0 0 0 0 0 0 
             16: 0 0 0 4 212 3079 8890 8941 
             24: 6177 4359 3625 2420 1639 974 594 438 
             32: 323 286 318 296 206 132 96 85 
             40: 65 73 62 53 47 55 49 52 
             48: 29 44 44 41 43 36 50 36 
             56: 40 30 29 40 35 30 25 31 
             64: 47 31 25 29 24 30 35 31 
             72: 28 31 17 37 35 30 20 33 
             80: 28 20 37 25 21 23 25 36 
             88: 27 35 22 23 15 24 34 28
             Total count: 123456 
    ]
    certain text
some text
]

那些变长的块存在于文本之间。我想在 : 之后读出所有数字并将它们保存在单独的数组中。在这种情况下，会有两个数组：

array1 = { 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 343 3938 9433 8756 6270 4472 3182 2503 1768 1140 836 496 326 273 349 19 6 469 1468 1468 1140 836 496 326 273 349 6 469 1464 8 1 56 47 50 46 64 35 42 53 42 40 41 34 35 41 39 39 47 30 30 39 12345 }

array2 = { 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 212 3079 8890 8941 6177 4359 3625 2420 1639 974 594 438 323 286 318 2656 206 5 32 2656 206 738 47 55 49 52 29 44 44 41 43 36 50 31 3月29 29 40 32 3 3 31 23 31 23 29 24 30 3 3 31 23 31 3月29 29 24 30 35 31 23 23 32 32 32 20 33 23 23 23 23 23 23 23 24 23 15 24 34 28 123456 }

我发现 lpeg 可能是实现它的一种轻量级方式。但我对 PEG 和 LPeg 完全陌生。请帮忙！

【问题讨论】：

标签： parsing logging lua text-parsing lpeg

【解决方案1】：

LPEG 版本：

local lpeg            = require "lpeg"
local lpegmatch       = lpeg.match
local C, Ct, P, R, S  = lpeg.C, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S
local Cg              = lpeg.Cg

local data_to_arrays

do
  local colon    = P":"
  local lbrak    = P"["
  local rbrak    = P"]"
  local digits   = R"09"^1
  local eol      = P"\n\r" + P"\r\n" + P"\n" + P"\r"
  local ws       = S" \t\v"
  local optws    = ws^0
  local getnum   = C(digits) / tonumber * optws
  local start    = lbrak * optws * eol
  local stop     = optws * rbrak
  local line     = optws * digits * colon * optws
                 * getnum * getnum * getnum * getnum
                 * getnum * getnum * getnum * getnum
                 * eol
  local count    = optws * P"Total count:" * optws * getnum * eol
  local inner    = Ct(line^1 * count^-1)
--local inner    = Ct(line^1 * Cg(count, "count")^-1)
  local array    = start * inner * stop
  local extract  = Ct((array + 1)^0)

  data_to_arrays = function (data)
    return lpegmatch (extract, data)
  end
end

这实际上只有在恰好有八个整数时才有效数据块的每一行。根据您输入的格式如何，这可能是诅咒或祝福;-)

还有一个测试文件：

data = [[
some text
[    
some text
         [
                  0: 0 0 0 0 0 0 0 0 
                  8: 0 0 0 0 0 0 0 0 
                 16: 0 0 0 9 343 3938 9433 8756 
                 24: 6270 4472 3182 2503 1768 1140 836 496 
                 32: 326 273 349 269 144 121 94 82 
                 40: 64 80 66 59 56 47 50 46 
                 48: 64 35 42 53 42 40 41 34 
                 56: 35 41 39 39 47 30 30 39 
                 Total count: 12345
        ]
    some text
]
some text
[
 some text
   [
              0: 0 0 0 0 0 0 0 0 
              8: 0 0 0 0 0 0 0 0 
             16: 0 0 0 4 212 3079 8890 8941 
             24: 6177 4359 3625 2420 1639 974 594 438 
             32: 323 286 318 296 206 132 96 85 
             40: 65 73 62 53 47 55 49 52 
             48: 29 44 44 41 43 36 50 36 
             56: 40 30 29 40 35 30 25 31 
             64: 47 31 25 29 24 30 35 31 
             72: 28 31 17 37 35 30 20 33 
             80: 28 20 37 25 21 23 25 36 
             88: 27 35 22 23 15 24 34 28 
    ]
    some text
some text
]
]]

local arrays = data_to_arrays (data)

for n = 1, #arrays do
  local ar   = arrays[n]
  local size = #ar
  io.write (string.format ("[%d] = { --[[size: %d items]]\n  ", n, size))
  for i = 1, size do
    io.write (string.format ("%d,%s", ar[i], (i % 5 == 0) and "\n  " or " "))
  end
  if ar.count ~= nil then
    io.write (string.format ("\n  [\"count\"] = %d,", ar.count))
  end
  io.write (string.format ("\n}\n"))
end

【讨论】：

嗨@phg，是的，输入数组每行精确8个整数。但这个文本文件超过 100 MB。我怎么能读入文件？我尝试了本地断言（io.open（文件路径））。它无法将文件读取为字符串。我应该将整个文件读取为字符串吗？
f = io.open(filename, "r") if f then data = f:read"*all" f:close() end 会将所有内容读入内存。如果这不起作用，您可能必须分块处理文件。
嗨@phg 是的，这个读取文件内容很好。但在文本文件中，真实场景是[some text[data array]some text]。那你lpeg就不行了。我未能修改您的 lpeg 以匹配此条件。你能帮我吗？
@Decula 您更新的示例在这里解析得很好。不能发的部分可以发吗？
我刚刚意识到我的数据数组有两种不同的类型。一个有总数.....我添加了`local lower = R"az"^1 local upper = R"AZ"^1 local words = lower+upper`并更改local line = optws * digits * colon * optws * getnum * getnum * getnum * getnum * getnum * getnum * getnum * getnum * eol * letter..它仍然没有工作。我只想将总数添加到数组的末尾

【解决方案2】：

试试这个不使用 LPEG 的代码：

-- assume T contains the text
local a={}
local i=0
for b in T:gmatch("%b[]") do
        b=b:gsub("%d+:","")
        i=i+1
        local t={}
        local j=0
        for n in b:gmatch("%d+") do
                j=j+1; t[j]=tonumber(n)
        end
        a[i]=t
end

【讨论】：

嗨@lhf，实际上，文本文件是[some text[data array]some text]； %b[] 将在 [] 之外捕获。如何捕获内部数据数组？
%b[] 很好用...但是我真的很想学习 lpeg 用于其他情况~~`

【解决方案3】：

我的纯 Lua 字符串库解决方案是这样的：

local bracket_pattern = "%b[]" --pattern for getting into brackets
local number_pattern = "(%d+)%s+" --pattern for parsing numbers
local output_array = {} --output 2-dimensional array
local i = 1
local j = 1
local tmp_number
local tmp_sub_str

for tmp_sub_str in file_content:gmatch(bracket_pattern) do --iterating through [string]
    table.insert(output_array, i, {}) --adding new [string] group
    for tmp_number in tmp_sub_str:gmatch(number_pattern) do --iterating through numberWHITESPACE
        table.insert(output_array[i], tonumber(tmp_number)) --adding [string] group element (number)
    end
    i = i + 1
end

编辑：这也适用于更新的文件格式。

【讨论】：

【解决方案4】：

phg 已经为您的问题提供了一个不错的 LPeg 解决方案，但这里有另一个使用 LPeg 的 re 模块的解决方案。语法更接近 BNF，并且使用的运算符更像是“正则表达式”，因此这个解决方案可能更容易理解。

re = require 're'

function dump(t)
  io.write '{'
  for _, v in ipairs(t) do
    io.write(v, ',')
  end
  io.write '}\n'
end

local textformat = [[
  data_in   <-  block+
  block     <-  text '[' block_content ']'
  block_content <- {| data_arr |} / (block / text)*
  data_arr  <- (text ':' nums whitesp)+
  text      <- whitesp [%w' ']+ whitesp
  nums      <- (' '+ {digits} -> tonumber)+
  digits    <- %d+
  whitesp   <- %s*
]]
local parser = re.compile(textformat, {tonumber = tonumber})
local arr1, arr2 = parser:match(data)

dump(arr1)
dump(arr2)

每个数据数组块都被捕获到一个单独的表中，并作为match 的输出之一返回。

data 与上面的输入相同，匹配并捕获了两个块，因此返回了 2 个表。检查这两个表给出：

{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,343,3938,9433,8756, 6270,4472,3182,2503, 1768,1140,836,496,326,273,349,269,144,121,94,82,64,80,66,59,56,47,50,46,64,35,42 ,53,42,40,41,34,35,41,39,39,47,30,30,39,12345,} {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,212,3079,8890,8941,6177,4359 ,3625,2420, 1639,974,594,438,323,286,318,296,206,132,96,85,65,73,62,53,47,55,49,52,29,44,44, 41,43,36,50,36,40,30,29,40,35,30,25,31,47,31,25,29,24,30,35,31,28,31,17,37, 35,30 ,20,33,28,20,37,25,21,23,25,36,27,35,22,23,15,24,34,28,}

【讨论】：

我发现 this 处理带有 BNF 样式 LPeg 的时间戳...但我没有实现它。
@Decula 请注意，上述语法只是对如何解析输入的粗略近似，基于对原始问题的输入的观察。由于您更了解被解析格式的范围，因此您应该改进语法以更好地匹配它。
嗨@greatwolf，我还在为 BNF 语法苦苦挣扎……实际上，我们有大量奇怪的日志、图像要处理……而且我们已经在 Milpitas 有一名 Google 员工作为承包商。我们真的需要像你这样的 C,C++&Lua 专家。我发现你回答了我的大部分问题。如果您有兴趣，请给我email

【解决方案5】：

我知道这是一个迟到的回复，但定义的语法要少得多，以下模式会找到打开的 [ 并捕获每个没有以 : 为后缀的数字，直到达到关闭的 ]。然后重复整个block，直到没有匹配到。

local patt = re.compile([=[
    data    <- {| block |}+
    block   <- ('[' ((%d+ ':') / { %d+ } -> int / [^]%d]+)+ ']') / ([^[]+ block)
]=], { int = tonumber })

您可以使用类似的方式在表格中一次捕获所有恢复的数组

local a = { patt:match[=[ ... ]=] }

【讨论】：