【问题标题】:Why can't I parse my scraped HTML into XML?为什么我不能将抓取的 HTML 解析为 XML?
【发布时间】:2014-09-08 03:00:34
【问题描述】:

我正在尝试使用 this function 将一些抓取的 HTML 解析为有效的 xml。

我的测试代码(从 Ben Nadel 的博客中复制粘贴了 htmlParse 函数):

<cfscript>
    // I take an HTML string and parse it into an XML(XHTML)
    // document. This is returned as a standard ColdFusion XML
    // document.
    function htmlParse( htmlContent, disableNamespaces = true ){

        // Create an instance of the Xalan SAX2DOM class as the
        // recipient of the TagSoup SAX (Simple API for XML) compliant
        // events. TagSoup will parse the HTML and announce events as
        // it encounters various HTML nodes. The SAX2DOM instance will
        // listen for such events and construct a DOM tree in response.
        var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

        // Create our TagSoup parser.
        var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();

        // Check to see if namespaces are going to be disabled in the
        // parser. If so, then they will not be added to elements.
        if (disableNamespaces){

        // Turn off namespaces - they are lame an nobody likes
        // to perform xmlSearch() methods with them in place.
        tagSoupParser.setFeature(
        tagSoupParser.namespacesFeature,
        javaCast( "boolean", false )
        );

        }

        // Set our DOM builder to be the listener for SAX-based
        // parsing events on our HTML.
        tagSoupParser.setContentHandler( saxDomBuilder );

        // Create our content input. The InputSource encapsulates the
        // means by which the content is read.
        var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
        createObject( "java", "java.io.StringReader" ).init( htmlContent )
        );

        // Parse the HTML. This will trigger events which the SAX2DOM
        // builder will translate into a DOM tree.
        tagSoupParser.parse( inputSource );

        // Now that the HTML has been parsed, we have to get a
        // representation that is similar to the XML document that
        // ColdFusion users are used to having. Let's search for the
        // ROOT document and return is.
        return(
        xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
        );

    }
</cfscript>
<cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset parsedData = htmlParse(html) />

(html 是从另一个函数以这种格式接收的,但我现在尝试对字符串进行硬编码以跟踪问题。)

我收到以下错误:

NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist. 
The error occurred in myfilePath/myfileName.cfm: line 42

40 :        // Parse the HTML. This will trigger events which the SAX2DOM
41 :        // builder will translate into a DOM tree.
42 :        tagSoupParser.parse( inputSource );

出了什么问题?我该如何纠正?

【问题讨论】:

  • 当您使用 ColdFusion 调用方法时,您在这里直接使用 Java。在我看来,输入的格式不正确,或者解析器中存在错误。
  • @J.T. - 不是说它也适用于“脏”的 html 并且输入的格式不必正确吗?
  • @froadie 不确定 Tagsoup 的问题,但我使用过 Jsoup,它运行良好。这两个链接可能很方便。 raymondcamden.com/2012/4/6/…bennadel.com/blog/…
  • @GauravS - 谢谢,我切换到 Jsoup,它运行良好......

标签: java coldfusion web-scraping html-parsing


【解决方案1】:

我没有使用过 TagSoup,但多年来我一直在使用 jTidy,从各种来源(包括 MS Word)获取用户提供的 HTML 并清理它以返回 XHTML,效果很好。

您可以通过将 jTidy jar 放到您的类路径中或使用 JavaLoader 加载它来在同一个文档上尝试 jTidy。由于您使用的是 CF10,因此您可以使用this method to include the JAR

那么,下面是如何在cfscript中调用jTidy:

jTidy = createObject("java", "org.w3c.tidy.Tidy");

jTidy.setQuiet(false);
jTidy.setIndentContent(true);
jTidy.setSmartIndent(true);
jTidy.setIndentAttributes(true);
jTidy.setWraplen(1024);
jTidy.setXHTML(true);
jTidy.setNumEntities(true);
jTidy.setConvertWindowsChars(true);             
jTidy.setFixBackslash(true);        // changes \ in urls to /
jTidy.setLogicalEmphasis(true);     // uses strong/em instead of b/i
jTidy.setDropEmptyParas(true);

// create the in and out streams for jTidy
readBuffer = CreateObject("java","java.lang.String").init(parseData).getBytes();
inP = createobject("java","java.io.ByteArrayInputStream").init(readBuffer);
outx = createObject("java", "java.io.ByteArrayOutputStream").init();

// do the parsing
jTidy.parse(inP,outx);
outstr = outx.toString();

这将返回您可以使用 XPath 查询的有效 XHTML。我将上面的内容包装到 makeValid() 函数中,然后针对您的 HTML 运行它:

    <cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset out = makeValid(html) />
<cfdump var="#xmlParse(out)#" />

这是输出:

【讨论】:

  • 根据@GauravS 的评论,我实际上最终使用了 Jsoup...但是谢谢你,关于如何包含罐子的链接也很有用。
猜你喜欢
  • 1970-01-01
  • 2011-06-11
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-02-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多