【问题标题】:how to use dom php parser如何使用 dom php 解析器
【发布时间】:2010-10-31 22:38:17
【问题描述】:

我是 PHP 中 DOM 解析的新手:
我有一个要解析的 HTML 文件。它有一堆像这样的DIV:

<div id="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div id="interestingbox"> 
......

我正在尝试使用 php 获取许多 div 框的内容。 如何使用 DOM 解析器来执行此操作?

谢谢!

【问题讨论】:

    标签: php dom html-parsing


    【解决方案1】:

    我使用simplehtmldom 作为开始:

    $html = file_get_html('example.com');
    foreach ($html->find('div[id=interestingbox]') as $result)
    {
        echo $result->innertext;
    }
    

    【讨论】:

    • 这个非常好用
    【解决方案2】:

    首先我必须告诉你,你不能在两个不同的 div 上使用相同的 id;有针对这一点的课程。每个元素都应该有一个唯一的 id。

    获取 id="interestingbox" 的 div 内容的代码

    $html = '
    <html>
    <head></head>
    <body>
    <div id="interestingbox"> 
       <div id="interestingdetails" class="txtnormal">
            <div>Content1</div>
            <div>Content2</div>
       </div>
    </div>
    
    <div id="interestingbox2"><a href="#">a link</a></div>
    </body>
    </html>';
    
    
    $dom_document = new DOMDocument();
    
    $dom_document->loadHTML($html);
    
    //use DOMXpath to navigate the html with the DOM
    $dom_xpath = new DOMXpath($dom_document);
    
    // if you want to get the div with id=interestingbox
    $elements = $dom_xpath->query("*/div[@id='interestingbox']");
    
    if (!is_null($elements)) {
    
      foreach ($elements as $element) {
        echo "\n[". $element->nodeName. "]";
    
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          echo $node->nodeValue. "\n";
        }
    
      }
    }
    
    //OUTPUT
    [div]  {
            Content1
            Content2
    }
    

    类示例:

    $html = '
    <html>
    <head></head>
    <body>
    <div class="interestingbox"> 
       <div id="interestingdetails" class="txtnormal">
            <div>Content1</div>
            <div>Content2</div>
       </div>
    </div>
    
    <div class="interestingbox"><a href="#">a link</a></div>
    </body>
    </html>';
    
    //the same as before.. just change the xpath
    
    [...]
    
    $elements = $dom_xpath->query("*/div[@class='interestingbox']");
    
    [...]
    
    //OUTPUT
    [div]  {
            Content1
            Content2
    }
    
    [div]  {
    a link
    }
    

    有关详细信息,请参阅DOMXPath 页面。

    【讨论】:

      【解决方案3】:

      来自http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue的非常好的功能

      function innerXML($node) 
      
      { 
      
          $doc  = $node->ownerDocument; 
      
          $frag = $doc->createDocumentFragment(); 
      
          foreach ($node->childNodes as $child) 
      
          { 
      
              $frag->appendChild($child->cloneNode(TRUE)); 
      
          } 
      
          return $doc->saveXML($frag); 
      
      }  
      
      
      $dom = new DOMDocument(); 
      
      $dom->loadXML(' 
      
      <html> 
      
      <body> 
      
      <table> 
      
      <tr> 
      
          <td id="foo">  
      
              The first bit of Data I want 
      
              <br />The second bit of Data I want 
      
              <br />The third bit of Data I want 
      
          </td> 
      
      </tr> 
      
      </table> 
      
      <body> 
      
      <html> 
      
      
      
      '); 
      
      $xpath = new DOMXPath($dom); 
      
      $node = $xpath->evaluate("/html/body//td[@id='foo' ]"); 
      
      $dataString = innerXML($node->item(0)); 
      $dataArr = explode("<br />", $dataString); 
      
      $dataUno = $dataArr[0]; 
      $dataDos = $dataArr[1]; 
      $dataTres = $dataArr[2]; 
      
      echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"  
      

      【讨论】:

        【解决方案4】:

        WebExtractor:https://github.com/knyga/webextractor 它可以使用 css、regex、xpath 选择器解析页面。

        查看包和测试示例:

        使用 WebExtractor\DataExtractor\DataExtractorFactory;采用 WebExtractor\DataExtractor\DataExtractorTypes;采用 WebExtractor\Client\Client;

        $factory = DataExtractorFactory::getFactory(); $提取器= $factory->createDataExtractor(DataExtractorTypes::CSS); $客户=新 客户; $内容= $client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics'); $extractor->setContent($content); $h1 = $extractor->setSelector('h1')->extract();

        【讨论】:

          猜你喜欢
          • 2013-01-31
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2014-09-17
          • 2015-08-25
          相关资源
          最近更新 更多