使用简单 HTML DOM（递归）查找嵌套链接答案

【问题标题】：Finding nested links with Simple HTML DOM (recursion)使用简单 HTML DOM（递归）查找嵌套链接
【发布时间】：2017-06-20 10:48:42
【问题描述】：

我是编程新手，所以这是我的问题。我正在尝试使用简单的 HTML DOM 解析器构建一个递归 php 蜘蛛，爬入某个网站并返回一个包含 2xx、3xx、4xx 和 5xx 的页面列表。几天来我一直在寻找解决方案，但是（可能是由于我的经验不足）我没有找到任何可行的方法。我的实际代码找到了根/索引页面上的所有链接，但是我希望能够在以前找到的链接中递归地找到链接等等，例如到第 5 级。假设根页面是第 0 级，递归我写的函数只显示了 1 级链接，重复了 5 次。任何帮助表示赞赏。谢谢。

<?php
  echo "<strong><h1>Sitemap</h1></strong><br>";

  include_once('simple_html_dom.php');

  $url = "http://www.gnet.it/";
  $html = new simple_html_dom();
  $html->load_file($url);
  echo "<strong><h2>Int Links</h2></strong><br>";
  foreach($html->find("a") as $a)
  {
    if((!(preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
    {
    echo "<strong>" . $a->href . "</strong><br>";
    }
  }

  echo "<strong><h2>Ext Links</h2></strong><br>";
  foreach($html->find("a") as $a)
  {
    if(((preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
    {
    echo "<strong>" . $a->href . "</strong><br>";
    }
  }


//recursion

    $depth = 1;
    $maxDepth = 5;
    $recurl = "$a->href";
    $rechtml = new simple_html_dom();
    $rechtml->load_file($recurl);
      while($depth <= $maxDepth){
        echo "<strong><h2>Link annidati livello $depth</h2></strong><br>";
        foreach($rechtml->find("a") as $a)
        {
          if(($a->href != null))
          {
          echo "<strong>" . $a->href . "</strong><br>";
          }
        }
        $depth++;
      }


//csv

  echo "<strong><h1>Google Crawl Errors from CSV</h1></strong><br>";
  echo "<table>\n\n";
$f = fopen("CrawlErrors.csv", "r");
while (($line = fgetcsv($f)) !== false) {
        echo "<tr>";
        foreach ($line as $cell) {
                echo "<td>" . htmlspecialchars($cell) . "</td>";
        }
        echo "</tr>\n";
}
fclose($f);
echo "\n</table>";
?>

【问题讨论】：

标签： php parsing recursion web-crawler

【解决方案1】：

试试这个：

我在基本的爬虫中调用此例程，以递归方式查找站点上的所有链接。您必须设置一些逻辑以防止它抓取与您网站上的页面链接的外部网站，否则您将永远运行！

注意，我确实从另一个 SO 线程得到了大部分代码，所以答案就在那里。

function crawl_page($url, $depth = 2){

// strip trailing slash from URL
if(substr($url, -1) == '/') {
    $url= substr($url, 0, -1);
}

// which URLs have we already crawled?
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
    return;
}
$seen[$url] = true;

$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
    $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
        // build the URLs to the same standard - with http:// etc
        $path = '/' . ltrim($href, '/');
        if (extension_loaded('http')) {
            $href = http_build_url($url, array('path' => $path));
        } else {
            $parts = parse_url($url);
            $href = $parts['scheme'] . '://';
            if (isset($parts['user']) && isset($parts['pass'])) {
                $href .= $parts['user'] . ':' . $parts['pass'] . '@';
            }
            $href .= $parts['host'];
            if (isset($parts['port'])) {
                $href .= ':' . $parts['port'];
            }
            $href .= $path;
        }
    }
    crawl_page($href, $depth - 1);
}

// pull out the actual page name without any parent dirs
$pos = strrpos($url, '/');
$slug = $pos === false ? "root" : substr($url, $pos + 1);

echo "slug:" . $slug . "<br>";
}

【讨论】：