将字符串列表映射到对象的层次结构答案

【问题标题】：mapping list of string into hierarchical structure of objects将字符串列表映射到对象的层次结构
【发布时间】：2012-04-12 10:57:36
【问题描述】：

这不是作业问题。这个问题是在面试测试中问我的一位朋友的。

我有一个list 从文件中读取的行作为输入。每行在行首都有一个标识符，例如 (A,B,NN,C,DD)。根据标识符，我需要将记录列表映射到单个对象A，其中包含对象的层次结构。

层次结构说明： 每个A 可以有零个或多个B 类型。每个B 标识符可以有零个或多个NN 和C 作为孩子。同样，每个C 段可以有零个或多个NN 和DD 子级。 abd 每个DD 可以有零个或多个NN 作为孩子。

映射类及其层次结构：

所有类都有value 来保存当前行的String 值。

**A - will have list of B**

    class A {
        List<B> bList;
        String value;

        public A(String value) {
            this.value = value;
        }

        public void addB(B b) {
            if (bList == null) {
                bList = new ArrayList<B>();
            }
            bList.add(b);
        }
    }


**B - will have list of NN and list of C**

    class B {
            List<C> cList;
            List<NN> nnList;
            String value;
                public B(String value) {
                this.value = value;
            }
                public void addNN(NN nn) {
                if (nnList == null) {
                    nnList = new ArrayList<NN>();
                }
                nnList.add(nn);
            }
                public void addC(C c) {
                if (cList == null) {
                    cList = new ArrayList<C>();
                }
                cList.add(c);
            }
        }

**C - will have list of DDs and NNs**

    class C {
            List<DD> ddList;
            List<NN> nnList;
            String value;
            public C(String value) {
                this.value = value;
            }
            public void addDD(DD dd) {
                if (ddList == null) {
                    ddList = new ArrayList<DD>();
                }
                ddList.add(dd);
            }
            public void addNN(NN nn) {
                if (nnList == null) {
                    nnList = new ArrayList<NN>();
                }
                nnList.add(nn);
            }
        }

**DD - will have list of NNs**

    class DD {
            String value;
            List<NN> nnList;
            public DD(String value) {
                this.value = value;
            }
            public void addNN(NN nn) {
                if (nnList == null) {
                    nnList = new ArrayList<NN>();
                }
                nnList.add(nn);
            }
        }

**NN- will hold the line only**

    class NN {
        String value;

        public NN(String value) {
            this.value = value;
        }
    }

到目前为止我做了什么：

方法public A parse(List<String> lines) 读取输入列表并返回对象A。因为可能有多个B，我创建了单独的方法'parseB 来解析每个事件。

在 parseB 方法中，循环通过 i = startIndex + 1 to i < lines.size() 并检查行的开头。 “NN”的出现被添加到B 的当前对象。如果在启动时检测到“C”，它会调用另一个方法parseC。当我们在开始时检测到“B”或“A”时，循环将中断。

在 parseC_DD 中使用了类似的逻辑。

public class GTTest {    
    public A parse(List<String> lines) {
        A a;
        for (int i = 0; i < lines.size(); i++) {
            String curLine = lines.get(i);
            if (curLine.startsWith("A")) { 
                a = new A(curLine);
                continue;
            }
            if (curLine.startsWith("B")) {
                i = parseB(lines, i); // returns index i to skip all the lines that are read inside parseB(...)
                continue;
            }
        }
        return a; // return mapped object
    }

    private int parseB(List<String> lines, int startIndex) {
        int i;
        B b = new B(lines.get(startIndex));
        for (i = startIndex + 1; i < lines.size(); i++) {
            String curLine = lines.get(i);
            if (curLine.startsWith("NN")) {
                b.addNN(new NN(curLine));
                continue;
            }
            if (curLine.startsWith("C")) {
                i = parseC(b, lines, i);
                continue;
            }
            a.addB(b);
            if (curLine.startsWith("B") || curLine.startsWith("A")) { //ending condition
                System.out.println("B A "+curLine);
                --i;
                break;
            }
        }
        return i; // return nextIndex to read
    }

    private int parseC(B b, List<String> lines, int startIndex) {

        int i;
        C c = new C(lines.get(startIndex));

        for (i = startIndex + 1; i < lines.size(); i++) {
            String curLine = lines.get(i);
            if (curLine.startsWith("NN")) {
                c.addNN(new NN(curLine));
                continue;
            }           

            if (curLine.startsWith("DD")) {
                i = parseC_DD(c, lines, i);
                continue;
            }

            b.addC(c);
            if (curLine.startsWith("C") || curLine.startsWith("A") || curLine.startsWith("B")) {
                System.out.println("C A B "+curLine);
                --i;
                break;
            }
        }
        return i;//return next index

    }

    private int parseC_DD(C c, List<String> lines, int startIndex) {
        int i;
        DD d = new DD(lines.get(startIndex));
        c.addDD(d);
        for (i = startIndex; i < lines.size(); i++) {
            String curLine = lines.get(i);
            if (curLine.startsWith("NN")) {
                d.addNN(new NN(curLine));
                continue;
            }
            if (curLine.startsWith("DD")) {
                d=new DD(curLine);
                continue;
            }       
            c.addDD(d);
            if (curLine.startsWith("NN") || curLine.startsWith("C") || curLine.startsWith("A") || curLine.startsWith("B")) {
                System.out.println("NN C A B "+curLine);
                --i;
                break;
            }

        }
        return i;//return next index

    }
public static void main(String[] args) {
        GTTest gt = new GTTest();
        List<String> list = new ArrayList<String>();
        list.add("A1");
        list.add("B1");
        list.add("NN1");
        list.add("NN2");
        list.add("C1");
        list.add("NNXX");
        list.add("DD1");
        list.add("DD2");
        list.add("NN3");
        list.add("NN4");
        list.add("DD3");
        list.add("NN5");
        list.add("B2");
        list.add("NN6");
        list.add("C2");
        list.add("DD4");
        list.add("DD5");
        list.add("NN7");
        list.add("NN8");
        list.add("DD6");
        list.add("NN7");
        list.add("C3");
        list.add("DD7");
        list.add("DD8");
        A a = gt.parse(list);
            //show values of a 

    }
}

我的逻辑不能正常工作。您还有其他方法可以弄清楚吗？您对我的方式有什么建议/改进吗？

【问题讨论】：

“我的逻辑不起作用”。这句话传达的信息为零。请说明您期望什么结果，您会得到什么，以及为什么您认为您应该得到前者而不是后者。

标签： parsing logic hierarchy

【解决方案1】：

使用对象的层次结构：


    public interface Node {
        Node getParent();
        Node getLastChild();
        boolean addChild(Node n);
        void setValue(String value);
        Deque  getChildren();
    }

    private static abstract class NodeBase implements Node {
        ...     
        abstract boolean canInsert(Node n);    
        public String toString() {
            return value;
        }
        ...    
    }

    public static class A extends NodeBase {
        boolean canInsert(Node n) {
            return n instanceof B;
        }
    }
    public static class B extends NodeBase {
        boolean canInsert(Node n) {
            return n instanceof NN || n instanceof C;
        }
    }

    ...

    public static class NN extends NodeBase {
        boolean canInsert(Node n) {
            return false;
        }
    }

创建一个树类：

public class MyTree {

    Node root;
    Node lastInserted = null;

    public void insert(String label) {
        Node n = NodeFactory.create(label);

        if (lastInserted == null) {
            root = n;
            lastInserted = n;
            return;
        }
        Node current = lastInserted;
        while (!current.addChild(n)) {
            current = current.getParent();
            if (current == null) {
                throw new RuntimeException("Impossible to insert " + n);
            }
        }
        lastInserted = n;
    }
    ...
}

然后打印树：


public class MyTree {
    ...
    public static void main(String[] args) {
        List input;
        ...
        MyTree tree = new MyTree();
        for (String line : input) {
            tree.insert(line);
        }
        tree.print();
    }

    public void print() {
        printSubTree(root, "");
    }
    private static void printSubTree(Node root, String offset) {
        Deque  children = root.getChildren();
        Iterator i = children.descendingIterator();
        System.out.println(offset + root);
        while (i.hasNext()) {
            printSubTree(i.next(), offset + " ");
        }
    }
}

【讨论】：

感谢您的出色回答。好树解决方案。但它需要对 A、B、C、DD、NN 等类进行大量更改。
您可以将A、B、C类与Node解耦；只留下 canInsert() 方法。或者，如果您将插入策略注入到 Tree 类中，您甚至可以使它们完全为空。

【解决方案2】：

具有 5 种状态的膳食自动机解决方案： 等待A， 见过A， 见过B， 见过 C，以及 看过DD。

解析完全在一种方法中完成。有一个current 节点是除NN 之外的最后一个节点。一个节点除了根节点外还有一个父节点。在状态 seen (0) 中，current 节点代表一个 (0)（例如在状态 seen C 中，current 可以是上例中的 C1 ）。最麻烦的是状态seen DD，它的输出边最多（B、C、DD 和 NN）。

public final class Parser {
    private final static class Token { /* represents A1 etc. */ }
    public final static class Node implements Iterable<Node> {
        /* One Token + Node children, knows its parent */
    }

    private enum State { ExpectA, SeenA, SeenB, SeenC, SeenDD, }

    public Node parse(String text) {
        return parse(Token.parseStream(text));
    }

    private Node parse(Iterable<Token> tokens) {
        State currentState = State.ExpectA;
        Node current = null, root = null;
        while(there are tokens) {
            Token t = iterator.next();
            switch(currentState) {
                /* do stuff for all states */

                /* example snippet for SeenC */
                case SeenC:
                if(t.Prefix.equals("B")) {
                    current.PN.PN.AddChildNode(new Node(t, current.PN.PN));
                    currentState = State.SeenB;
                } else if(t.Prefix.equals("C")) {

            }
        }
        return root;
    }
}

我不满意那些火车残骸爬上层次结构在其他地方插入节点 (current.PN.PN)。最终，显式状态类将使私有parse 方法更具可读性。然后，该解决方案变得更类似于@AlekseyOtrubennikov 提供的解决方案。也许直接的LL 方法会产生更漂亮的代码。也许最好将语法改写为BNF 并委托解析器创建。

一个简单的 LL 解析器，一个生产规则：

// "B" ("NN" || C)*
private Node rule_2(TokenStream ts, Node parent) {
    // Literal "B"
    Node B = literal(ts, "B", parent);
    if(B == null) {
        // error
        return null;
    }

    while(true) {
        // check for "NN"
        Node nnLit = literal(ts, "NN", B);
        if(nnLit != null)
            B.AddChildNode(nnLit);

        // check for C
        Node c = rule_3(ts, parent);
        if(c != null)
            B.AddChildNode(c);

        // finished when both rules did not match anything
        if(nnLit == null && c == null)
            break;
    }

    return B;
}

TokenStream 增强了Iterable<Token> 通过允许向前看流 - LL(1) 因为解析器必须在两种情况下选择文字 NN 或深度潜水（rule_2 是其中之一）。看起来不错，但是这里缺少一些 C# 功能...

【讨论】：

嗨斯特凡！ LL 解析器通常像我的答案中的代码一样工作。我的代码使用树结构，你的使用递归树。
是的，canInsert 方法类似于一个令牌/节点前瞻。不知何故，在您的解决方案中，生产规则和相应的节点被归为一类。而且由于节点知道他们的孩子，代码也可以处理更复杂的语法。嗯，我想我必须重新实现您的解决方案:)

【解决方案3】：

@Stefan 和@Aleksey 是正确的：这是一个简单的解析问题。您可以在Extended Backus-Naur Form 中定义您的层次结构约束：

A  ::= { B }
B  ::= { NN | C }
C  ::= { NN | DD }
DD ::= { NN }

这个描述可以转化为状态机来实现。但是有很多工具可以有效地为您做到这一点：Parser generators。

我发布我的答案只是为了表明使用 Haskell（或其他一些函数式语言）解决此类问题非常容易。
这是完整程序，它从标准输入读取字符串并将解析后的树打印到标准输出。

-- We are using some standard libraries.
import Control.Applicative ((<$>), (<*>))
import Text.Parsec
import Data.Tree

-- This is EBNF-like description of what to do.
-- You can almost read it like a prose.
yourData = nodeA +>> eof

nodeA  = node "A"  nodeB
nodeB  = node "B" (nodeC  <|> nodeNN)
nodeC  = node "C" (nodeNN <|> nodeDD)
nodeDD = node "DD" nodeNN
nodeNN = (`Node` []) <$> nodeLabel "NN"

node lbl children
  = Node <$> nodeLabel lbl <*> many children

nodeLabel xx = (xx++)
  <$> (string xx >> many digit)
  +>> newline

-- And this is some auxiliary code.
f +>> g = f >>= \x -> g >> return x

main = do
  txt <- getContents
  case parse yourData "" txt of
    Left err  -> print err
    Right res -> putStrLn $ drawTree res

使用zz.txt 中的数据执行它会打印出这棵漂亮的树：

$ ./xxx < zz.txt 
A1
+- B1
|  +- NN1
|  +- NN2
|  `- C1
|     +- NN2
|     +- DD1
|     +- DD2
|     |  +- NN3
|     |  `- NN4
|     `- DD3
|        `- NN5
`- B2
   +- NN6
   +- C2
   |  +- DD4
   |  +- DD5
   |  |  +- NN7
   |  |  `- NN8
   |  `- DD6
   |     `- NN9
   `- C3
      +- DD7
      `- DD8

这是它如何处理格式错误的输入：

$ ./xxx
A1
B2
DD3
(line 3, column 1):
unexpected 'D'
expecting "B" or end of input

【讨论】：

我猜这是一个 OOP 问题
@Aleksey，看来你是对的。 parsing 标签的存在和oop 的缺失让我感到困惑。