【发布时间】:2016-04-11 07:00:29
【问题描述】:
我正在为一个简单的正则表达式引擎手写一个解析器。
引擎支持a .. z|*和连接和括号
这是我制作的CFG:
exp = concat factor1
factor1 = "|" exp | e
concat = term factor2
factor2 = concat | e
term = element factor3
factor3 = * | e
element = (exp) | a .. z
等于
S = T X
X = "|" S | E
T = F Y
Y = T | E
F = U Z
Z = *| E
U = (S) | a .. z
对于交替和关闭,我可以通过向前看并根据令牌选择生产来轻松处理它们。但是,没有办法通过向前看来处理连接,因为它是隐式的。
我想知道如何处理连接或者我的语法有什么问题?
这是我用于解析的 OCaml 代码:
type regex =
| Closure of regex
| Char of char
| Concatenation of regex * regex
| Alternation of regex * regex
(*| Epsilon*)
exception IllegalExpression of string
type token =
| End
| Alphabet of char
| Star
| LParen
| RParen
| Pipe
let rec parse_S (l : token list) : (regex * token list) =
let (a1, l1) = parse_T l in
let (t, rest) = lookahead l1 in
match t with
| Pipe ->
let (a2, l2) = parse_S rest in
(Alternation (a1, a2), l2)
| _ -> (a1, l1)
and parse_T (l : token list) : (regex * token list) =
let (a1, l1) = parse_F l in
let (t, rest) = lookahead l1 in
match t with
| Alphabet c -> (Concatenation (a1, Char c), rest)
| LParen ->
(let (a, l1) = parse_S rest in
let (t1, l2) = lookahead l1 in
match t1 with
| RParen -> (Concatenation (a1, a), l2)
| _ -> raise (IllegalExpression "Unbalanced parentheses"))
| _ ->
let (a2, rest) = parse_T l1 in
(Concatenation (a1, a2), rest)
and parse_F (l : token list) : (regex * token list) =
let (a1, l1) = parse_U l in
let (t, rest) = lookahead l1 in
match t with
| Star -> (Closure a1, rest)
| _ -> (a1, l1)
and parse_U (l : token list) : (regex * token list) =
let (t, rest) = lookahead l in
match t with
| Alphabet c -> (Char c, rest)
| LParen ->
(let (a, l1) = parse_S rest in
let (t1, l2) = lookahead l1 in
match t1 with
| RParen -> (a, l2)
| _ -> raise (IllegalExpression "Unbalanced parentheses"))
| _ -> raise (IllegalExpression "Unknown token")
【问题讨论】:
-
你只需要构造 FIRST 集合,就像任何其他 LL 语法一样。所以 FIRST(factor2) = FIRST(concat) = FIRST(term) = FIRST(element) = {
(,a, ...,z} -
我会在这里说我真的不明白 LL(1) 解析器的意义。有非常好的 LALR(1) 生成器工具,包括为 ocaml 编写的工具,并且 LR 解析不需要您更改语法以错误关联和不可读。是的,这是一种意见。
-
@rici 您好,感谢您的回复。你介意再详细说明一下吗?我更改了我的 parse_T 函数,它现在使用 LParen 和 char 作为前瞻标记。但是当我测试“a(b|c)*”时,“*”没有被我的解析器识别
-
我想这将是您解析 factor3 的函数的问题。恐怕我接触 ocaml 已经十多年了,这(加上我在第二条评论中表达的偏见)是我没有尝试回答你的问题的原因。
标签: regex parsing syntax language-agnostic ocaml