解析字符串以获取内容答案

【问题标题】：parsing string to get content解析字符串以获取内容
【发布时间】：2014-09-09 11:57:40
【问题描述】：

我有以下 html 字符串：

<h3>I only want this content</h3> I don't want this content <b>random content</b>

我只想从 h3 标签中获取内容并删除其他内容。我有以下内容：

String getArticleBody = listArt.getChildText("body");
StringBuilder mainArticle = new StringBuilder();
String getSubHeadlineFromArticle;

if(getArticleBody.startsWith("<h3>") && getArticleBody.endsWith("</h3>")){
    mainArticle.append(getSubHeadlineFromArticle);
 }

但这会返回整个内容，这不是我想要的。如果有人可以帮助我，那将非常感谢。

【问题讨论】：

您需要存储该内容。
见：stackoverflow.com/questions/16597303/…

标签： java html parsing

【解决方案1】：

谢谢，伙计们。你所有的答案都有效，但我最终使用了 Jsoup。

String getArticleBody = listArt.getChildText("body");
org.jsoup.nodes.Document docc = Jsoup.parse(getArticleBody);
org.jsoup.nodes.Element h3Tag = docc.getElementsByTag("h3").first();
String getSubHeadlineFromArticle = h3Tag.text();

【讨论】：

【解决方案2】：

其他答案已经涵盖了如何获得您想要的结果。我将评论您的代码以解释为什么它还没有这样做。（请注意，我修改了您的变量名称，因为字符串没有得到任何东西；它们是一件事。）

// declare a bunch of variables
String articleBody = listArt.getChildText("body");
StringBuilder mainArticle = new StringBuilder();
String subHeadlineFromArticle;

// check to see if the article body consists entirely of a subheadline
if(articleBody.startsWith("<h3>") && articleBody.endsWith("</h3>")){
    // if it does, append an empty string to the StringBuilder
    mainArticle.append(subHeadlineFromArticle);
}
// if it doesn't, don't do anything

// final result:
//   articleBody = the entire article body
//   mainArticle = empty StringBuilder (regardless of whether you appended anything)
//   subHeadlineFromArticle = empty string

【讨论】：

【解决方案3】：

你需要像这样使用正则表达式：

public static void main(String[] args) {
    String str = "<h3>asdfsdafsdaf</h3>dsdafsdfsafsadfa<h3>second</h3>";
    // your pattern goes here
    // ? is important since you need to catch the nearest closing tag
    Pattern pattern = Pattern.compile("<h3>(.+?)</h3>"); 
    Matcher matcher = pattern.matcher(str);
    while (matcher.find()) System.out.println(matcher.group(1));
}

matcher.group(1) 准确返回 h3 标签之间的文本。

【讨论】：

【解决方案4】：

使用正则表达式
它可能会帮助你：

String str = "<h3>I only want this content</h3> I don't want this content <b>random content</b>";
final Pattern pattern = Pattern.compile("<h3>(.+?)</h3>");
final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

输出：

I only want this content

【讨论】：

【解决方案5】：

试试这个

String result = getArticleBody.substring(getArticleBody.indexOf("<h3>"), getArticleBody.indexOf("</h3>"))
                .replaceFirst("<h3>", "");
System.out.println(result);

【讨论】：

【解决方案6】：

你可以像这样使用 substring 方法 -

String a="<h3>I only want this content</h3> I don't want this content <b>random content</b>";
System.out.println(a.substring(a.indexOf("<h3>")+4,a.indexOf("</h3>")));

输出 -

I only want this content

【讨论】：