使用 JSoup 进行 Html 解析答案

【问题标题】：Html parsing with JSoup使用 JSoup 进行 Html 解析
【发布时间】：2012-09-03 22:10:25
【问题描述】：

我正在尝试解析以下 URL 的 html：

http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-050-thermal-energy-fall-2002/

获取包含教师姓名的“

”标签的文本。所需信息位于“”标签内，但我无法使用 JSoup 检索标签。我不知道我做错了什么，因为当我将标签保存在 Element 对象中时，我们称它为 'b' 而我调用 b.getAllElements() 它不显示

作为元素之一。 Jsoup 的 getAllElements() 方法不就是这样做的吗？如果不能，请向我解释我显然缺少的层次结构，因为解析器无法找到

标签包含我需要的文本，在本例中是“Prof. Zoltan Spakovszky”。

任何帮助将不胜感激。

public void getHomePageLinks()
{
    String html = "http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-050-thermal-energy-fall-2002/";
    org.jsoup.nodes.Document doc = Jsoup.parse(html);

    Elements bodies = doc.select("body");

    for(Element body : bodies )
    {
        System.out.println(body.getAllElements());
    }

}

输出是：

http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-050-thermal-energy-fall-2002/

不是要把文档中body标签内的所有元素都打印出来吗？

【问题讨论】：

一个代码 sn-p 可能会有所帮助。
您的代码会打印正文及其所有内容。但是，如果您只想打印正文标签（及其所有子标签），您可以使用它：System.out.println(doc.body());（请参阅下面关于获取 doc 的答案）

标签： java html html-parsing jsoup

【解决方案1】：

我对 JSoup 一无所知，但似乎如果您想要讲师的名字，您可以通过以下方式访问它：

Element instructor = doc.select("div.chpstaff div p");

【讨论】：

不知道为什么有人对你投了反对票，但我 +1 因为你的答案是正确的。

【解决方案2】：

可能你已经解决了，但我一直在努力，所以无法抗拒提交

import java.io.IOException;
import java.util.logging.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class JavaApplication17 {

public static void main(String[] args) {

try {
   String url = "http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-050-thermal-energy-   fall-2002/";
  Document doc = Jsoup.connect(url).get();
  Elements paragraphs = doc.select("p");
  for(Element p : paragraphs)
    System.out.println(p.text());

} 
catch (IOException ex) {
  Logger.getLogger(JavaApplication17.class.getName())
        .log(Level.SEVERE, null, ex);
   }
  }
}

is it what u meant?

【讨论】：

【解决方案3】：

这是一个简短的例子：

// Connect to the website and parse it into a document
Document doc = Jsoup.connect("http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-050-thermal-energy-fall-2002/").get();

// Select all elements you need (se below for documentation)
Elements elements = doc.select("div[class=chpstaff] p");

// Get the text of the first element
String instructor = elements.first().text();

// eg. print the result
System.out.println(instructor);

在此处查看 jsoup 选择器 api 的文档：Jsoup Codebook
它不是很难使用，但非常强大。

【讨论】：

【解决方案4】：

这是代码

Document document = Jsoup.connect("http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-050-thermal-energy-fall-2002/").get();

        Elements elements = document.select("p");
        System.out.println(elements.html());

您可以使用 Jsoup 的 Selector 属性选择所有标签。它将返回

的文本和标签

.

【讨论】：

【解决方案5】：

        Elements ele=doc.select("p");
      ' String text=ele.text();
        System.out.println(text);

试试这个，我认为它会工作

【讨论】：