【问题标题】:parsing string to get content解析字符串以获取内容
【发布时间】:2014-09-09 11:57:40
【问题描述】:

我有以下 html 字符串:

<h3>I only want this content</h3> I don't want this content <b>random content</b>

我只想从 h3 标签中获取内容并删除其他内容。我有以下内容:

String getArticleBody = listArt.getChildText("body");
StringBuilder mainArticle = new StringBuilder();
String getSubHeadlineFromArticle;

if(getArticleBody.startsWith("<h3>") && getArticleBody.endsWith("</h3>")){
    mainArticle.append(getSubHeadlineFromArticle);
 }

但这会返回整个内容,这不是我想要的。如果有人可以帮助我,那将非常感谢。

【问题讨论】:

标签: java html parsing


【解决方案1】:

谢谢,伙计们。你所有的答案都有效,但我最终使用了 Jsoup。

String getArticleBody = listArt.getChildText("body");
org.jsoup.nodes.Document docc = Jsoup.parse(getArticleBody);
org.jsoup.nodes.Element h3Tag = docc.getElementsByTag("h3").first();
String getSubHeadlineFromArticle = h3Tag.text();

【讨论】:

    【解决方案2】:

    其他答案已经涵盖了如何获得您想要的结果。我将评论您的代码以解释为什么它还没有这样做。 (请注意,我修改了您的变量名称,因为字符串没有得到任何东西;它们一件事。)

    // declare a bunch of variables
    String articleBody = listArt.getChildText("body");
    StringBuilder mainArticle = new StringBuilder();
    String subHeadlineFromArticle;
    
    // check to see if the article body consists entirely of a subheadline
    if(articleBody.startsWith("<h3>") && articleBody.endsWith("</h3>")){
        // if it does, append an empty string to the StringBuilder
        mainArticle.append(subHeadlineFromArticle);
    }
    // if it doesn't, don't do anything
    
    // final result:
    //   articleBody = the entire article body
    //   mainArticle = empty StringBuilder (regardless of whether you appended anything)
    //   subHeadlineFromArticle = empty string
    

    【讨论】:

      【解决方案3】:

      你需要像这样使用正则表达式:

      public static void main(String[] args) {
          String str = "<h3>asdfsdafsdaf</h3>dsdafsdfsafsadfa<h3>second</h3>";
          // your pattern goes here
          // ? is important since you need to catch the nearest closing tag
          Pattern pattern = Pattern.compile("<h3>(.+?)</h3>"); 
          Matcher matcher = pattern.matcher(str);
          while (matcher.find()) System.out.println(matcher.group(1));
      }
      

      matcher.group(1) 准确返回 h3 标签之间的文本。

      【讨论】:

        【解决方案4】:

        使用正则表达式
        它可能会帮助你:

        String str = "<h3>I only want this content</h3> I don't want this content <b>random content</b>";
        final Pattern pattern = Pattern.compile("<h3>(.+?)</h3>");
        final Matcher matcher = pattern.matcher(str);
        matcher.find();
        System.out.println(matcher.group(1)); // Prints String I want to extract
        

        输出:

        I only want this content
        

        【讨论】:

          【解决方案5】:

          试试这个

          String result = getArticleBody.substring(getArticleBody.indexOf("<h3>"), getArticleBody.indexOf("</h3>"))
                          .replaceFirst("<h3>", "");
          System.out.println(result);
          

          【讨论】:

            【解决方案6】:

            你可以像这样使用 substring 方法 -

            String a="<h3>I only want this content</h3> I don't want this content <b>random content</b>";
            System.out.println(a.substring(a.indexOf("<h3>")+4,a.indexOf("</h3>")));
            

            输出 -

            I only want this content
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 2016-02-20
              • 1970-01-01
              • 2012-03-04
              相关资源
              最近更新 更多