【问题标题】:Issue on parsing Html with jsoup使用 jsoup 解析 Html 的问题
【发布时间】:2013-06-26 05:58:05
【问题描述】:

我正在尝试使用 jsoup 解析这个 HTML

我的代码是:

doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();

            Elements items = doc.select("item");
            Log.d(TAG, "Items size : " + items.size());
            for (Element item : items) {
                Log.d(TAG, "in for loop of items");

                Element titleElement = item.select("title").first();
                mTitle = titleElement.text().toString();
                Log.d(TAG, "title is : " + mTitle);

                Element linkElement = item.select("link").first();
                mLink = linkElement.text().toString();
                Log.d(TAG, "link is : " + mLink);

                Element descElement = item.select("description").first();
                mDesc = descElement.text().toString();
                Log.d(TAG, "description is : " + mDesc);


            }

我得到以下输出:

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : 
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img></a> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>

但我希望输出为:

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/KDcQe4gF-3U/62828262.mp3  
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.

我应该对我的代码进行哪些更改?

如何实现我的目标。请帮帮我!!

提前谢谢你!!

【问题讨论】:

    标签: android html-parsing jsoup


    【解决方案1】:

    您提取的rss 内容有 2 个问题。

    1. link 文本不在 &lt;link/&gt; 标记之内,而是在它之外。
    2. description 标签中有一些escaped html 内容。

    PFB修改后的代码。

    当我查看Browser 中的URL 时,我还发现了一些干净的html 内容,在解析这些内容时,您可以轻松提取所需的字段。您可以在Jsoup 中将userAgent 设置为Browser。但如何获取内容由您决定。

        doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
        System.out.println(doc.html());
        System.out.println("================================");
        Elements items = doc.select("item");
        for (Element item : items) {
    
            Element titleElement = item.select("title").first();
            String mTitle = titleElement.text();
            System.out.println("title is : " + mTitle);
    
            /*
             * The link in the rss is as follows
             *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
             *  which doesn't fall in the <link> element but falls under <item> TextNode
             */
            String  mLink = item.ownText(); //  
            System.out.println("link is : " + mLink);
    
            Element descElement = item.select("description").first();
            /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
             * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
             */
            String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
            System.out.println("description is : " + mDesc);
    
        }
    

    【讨论】:

    猜你喜欢
    • 2015-06-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-11-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多