【问题标题】:how to extract links and titles from a .html page?如何从.html 页面中提取链接和标题?
【发布时间】:2010-12-12 18:35:58
【问题描述】:

对于我的网站,我想添加一个新功能。

我希望用户能够上传他的书签备份文件(如果可能的话,从任何浏览器),这样我就可以将它上传到他们的个人资料中,而他们不必手动插入所有这些文件......

我唯一缺少的部分是从上传的文件中提取标题和 URL 的部分.. 谁能提供线索从哪里开始或从哪里阅读?

使用搜索选项和 (How to extract data from a raw HTML file?) 这是我最相关的问题,它没有谈论它..

我真的不介意它是使用 jquery 还是 php

非常感谢。

【问题讨论】:

标签: php html string hyperlink web-crawler


【解决方案1】:

谢谢大家,我知道了!

最终代码:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

这会显示为 .html 文件中所有链接分配的 anchor 文本和 href

再次,非常感谢。

【讨论】:

  • 是$dom->load($html)的@前缀;商榷?否则-我现在在项目中使用的整洁代码:)
  • @benjaminhull 是为了防止代码抛出任何警告;)
【解决方案2】:

这可能就足够了:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}

【讨论】:

  • whre $html 是文件的路径?感谢您的快速回答:D
  • @Toni, $html 是包含 HTML 的字符串。您可以使用$dom-&gt;loadHTMLFile() 直接从文件加载。 (您可能需要在其前面加上 @ 以禁止警告。)
  • 哇!非常感谢!好像它快完成了!我可以获得链接,但我在名称或标题方面遇到了问题(我都试过了)
  • 我不知道你说的名字或头衔是什么意思。 $node-&gt;nodeValue 是书签的名称。
【解决方案3】:

假设存储的链接在 html 文件中,最好的解决方案可能是使用 html 解析器,例如 PHP Simple HTML DOM Parser(我自己从未尝试过)。 (另一种选择是使用基本字符串搜索或正则表达式进行搜索,您可能从不使用正则表达式来解析 html)。

使用解析器读取 html 文件后,使用它的函数找到 a 标签:

来自教程:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

【讨论】:

    【解决方案4】:

    这是一个示例,您可以在您的情况下使用:

    $content = file_get_contents('bookmarks.html');
    

    运行这个:

    <?php
    
    $content = '<html>
    
    <title>Random Website I am Crawling</title>
    
    <body>
    
    Click <a href="http://clicklink.com">here</a> for foobar
    
    Another site is http://foobar.com
    
    </body>
    
    </html>';
    
    $regex = "((https?|ftp)\:\/\/)?"; // SCHEME
    $regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
    $regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
    $regex .= "(\:[0-9]{2,5})?"; // Port
    $regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
    $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
    $regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor
    
    
    $matches = array(); //create array
    $pattern = "/$regex/";
    
    preg_match_all($pattern, $content, $matches); 
    
    print_r(array_values(array_unique($matches[0])));
    echo "<br><br>";
    echo implode("<br>", array_values(array_unique($matches[0])));
    

    输出:

    Array
    (
        [0] => http://clicklink.com
        [1] => http://foobar.com
    )
    

    http://clicklink.com

    http://foobar.com

    【讨论】:

      【解决方案5】:

      我想从 html 页面创建一个包含链接路径及其文本的 CSV,这样我就可以从网站上抓取菜单等。

      在此示例中,您指定您感兴趣的域,这样您就不会离开站点链接,然后它会为每个文档生成一个 CSV

      /**
       * Extracts links to the given domain from the files and creates CSVs of the links
       */
      
      
      $LinkExtractor = new LinkExtractor('https://www.example.co.uk');
      
      $LinkExtractor->extract(__DIR__ . '/hamburger.htm');
      $LinkExtractor->extract(__DIR__ . '/navbar.htm');
      $LinkExtractor->extract(__DIR__ . '/footer.htm');
      
      class LinkExtractor {
          public $domain;
      
          public function __construct($domain) {
            $this->domain = $domain;
          }
      
          public function extract($file) {
              $html = file_get_contents($file);
              //Create a new DOM document
              $dom = new DOMDocument;
      
              //Parse the HTML. The @ is used to suppress any parsing errors
              //that will be thrown if the $html string isn't valid XHTML.
              @$dom->loadHTML($html);
      
              //Get all links. You could also use any other tag name here,
              //like 'img' or 'table', to extract other tags.
              $links = $dom->getElementsByTagName('a');
      
              $results = [];
              //Iterate over the extracted links and display their URLs
              foreach ($links as $link){
                  //Extract and sput the matching links in an array for the CSV
                  $href = $link->getAttribute('href');
                  $parts = parse_url($href);
                  if (!empty($parts['path']) && strpos($this->domain, $parts['host']) !== false) {
                      $results[$parts['path']] = [$parts['path'], $link->nodeValue];
                  }
              }
      
              asort($results);
              // Make the CSV
              $fp = fopen($file .'.csv', 'w');
              foreach ($results as $fields) {
                  fputcsv($fp, $fields);
              }
              fclose($fp);
          }
      }
      

      【讨论】:

      • OP 寻求支持的内容已经在多年前以与您的回答相同的方式得到了回答。您正在实施一项更大的任务这一事实与这个问题无关。如果您能找到一个寻求您提供的所有功能的问题,请在此处发布您的答案。当同一页面上有许多帖子给出相同答案时,Stackoverflow 作为研究人员工具的可用性就会受损,因为彻底的研究人员会浪费时间阅读多余的建议。
      【解决方案6】:
      $html = file_get_contents('your file path');
      
      $dom = new DOMDocument;
      
      @$dom->loadHTML($html);
      
      $styles = $dom->getElementsByTagName('link');
      
      $links = $dom->getElementsByTagName('a');
      
      $scripts = $dom->getElementsByTagName('script');
      
      foreach($styles as $style)
      {
      
          if($style->getAttribute('href')!="#")
      
          {
              echo $style->getAttribute('href');
              echo'<br>';
          }
      }
      
      foreach ($links as $link){
      
          if($link->getAttribute('href')!="#")
          {
              echo $link->getAttribute('href');
              echo'<br>';
          }
      }
      
      foreach($scripts as $script)
      {
      
              echo $script->getAttribute('src');
              echo'<br>';
      
      }
      

      【讨论】:

      • 样式设置失败,答案难以阅读。请编辑您的答案并使其更具可读性
      【解决方案7】:

      这是我为我的一位客户所做的工作,并将其作为一种可以在任何地方使用的功能。

      function getValidUrlsFrompage($source)
        {
          $links = [];
          $content = file_get_contents($source);
          $content = strip_tags($content, "<a>");
          $subString = preg_split("/<\/a>/", $content);
          foreach ($subString as $val) {
            if (strpos($val, "<a href=") !== FALSE) {
              $val = preg_replace("/.*<a\s+href=\"/sm", "", $val);
              $val = preg_replace("/\".*/", "", $val);
              $val = trim($val);
            }
            if (strlen($val) > 0 && filter_var($val, FILTER_VALIDATE_URL)) {
              if (!in_array($val, $links)) {
                $links[] = $val;
              }
            }
          }
          return $links;
        }
      

      并像使用它

      $links = getValidUrlsFrompage("https://www.w3resource.com/");
      

      而预期的输出是在一个数组中得到 99 个 URL,

      Array ( [0] => https://www.w3resource.com [1] => https://www.w3resource.com/html/HTML-tutorials.php [2] => https://www.w3resource.com/css/CSS-tutorials.php [3] => https://www.w3resource.com/javascript/javascript.php [4] => https://www.w3resource.com/html5/introduction.php [5] => https://www.w3resource.com/schema.org/introduction.php [6] => https://www.w3resource.com/phpjs/use-php-functions-in-javascript.php [7] => https://www.w3resource.com/twitter-bootstrap/tutorial.php [8] => https://www.w3resource.com/responsive-web-design/overview.php [9] => https://www.w3resource.com/zurb-foundation3/introduction.php [10] => https://www.w3resource.com/pure/ [11] => https://www.w3resource.com/html5-canvas/ [12] => https://www.w3resource.com/course/javascript-course.html [13] => https://www.w3resource.com/icon/ [14] => https://www.w3resource.com/linux-system-administration/installation.php [15] => https://www.w3resource.com/linux-system-administration/linux-commands-introduction.php [16] => https://www.w3resource.com/php/php-home.php [17] => https://www.w3resource.com/python/python-tutorial.php [18] => https://www.w3resource.com/java-tutorial/ [19] => https://www.w3resource.com/node.js/node.js-tutorials.php [20] => https://www.w3resource.com/ruby/ [21] => https://www.w3resource.com/c-programming/programming-in-c.php [22] => https://www.w3resource.com/sql/tutorials.php [23] => https://www.w3resource.com/mysql/mysql-tutorials.php [24] => https://w3resource.com/PostgreSQL/tutorial.php [25] => https://www.w3resource.com/sqlite/ [26] => https://www.w3resource.com/mongodb/nosql.php [27] => https://www.w3resource.com/API/google-plus/tutorial.php [28] => https://www.w3resource.com/API/youtube/tutorial.php [29] => https://www.w3resource.com/API/google-maps/index.php [30] => https://www.w3resource.com/API/flickr/tutorial.php [31] => https://www.w3resource.com/API/last.fm/tutorial.php [32] => https://www.w3resource.com/API/twitter-rest-api/ [33] => https://www.w3resource.com/xml/xml.php [34] => https://www.w3resource.com/JSON/introduction.php [35] => https://www.w3resource.com/ajax/introduction.php [36] => https://www.w3resource.com/html-css-exercise/index.php [37] => https://www.w3resource.com/javascript-exercises/ [38] => https://www.w3resource.com/jquery-exercises/ [39] => https://www.w3resource.com/jquery-ui-exercises/ [40] => https://www.w3resource.com/coffeescript-exercises/ [41] => https://www.w3resource.com/php-exercises/ [42] => https://www.w3resource.com/python-exercises/ [43] => https://www.w3resource.com/c-programming-exercises/ [44] => https://www.w3resource.com/csharp-exercises/ [45] => https://www.w3resource.com/java-exercises/ [46] => https://www.w3resource.com/sql-exercises/ [47] => https://www.w3resource.com/oracle-exercises/ [48] => https://www.w3resource.com/mysql-exercises/ [49] => https://www.w3resource.com/sqlite-exercises/ [50] => https://www.w3resource.com/postgresql-exercises/ [51] => https://www.w3resource.com/mongodb-exercises/ [52] => https://www.w3resource.com/twitter-bootstrap/examples.php [53] => https://www.w3resource.com/euler-project/ [54] => https://w3resource.com/w3skills/html5-quiz/ [55] => https://w3resource.com/w3skills/php-fundamentals/ [56] => https://w3resource.com/w3skills/sql-beginner/ [57] => https://w3resource.com/w3skills/python-beginner-quiz/ [58] => https://w3resource.com/w3skills/mysql-basic-quiz/ [59] => https://w3resource.com/w3skills/javascript-basic-skill-test/ [60] => https://w3resource.com/w3skills/javascript-advanced-quiz/ [61] => https://w3resource.com/w3skills/javascript-quiz-part-iii/ [62] => https://w3resource.com/w3skills/mongodb-basic-quiz/ [63] => https://www.w3resource.com/form-template/ [64] => https://www.w3resource.com/slides/ [65] => https://www.w3resource.com/convert/number/binary-to-decimal.php [66] => https://www.w3resource.com/excel/ [67] => https://www.w3resource.com/video-tutorial/php/some-basics-of-php.php [68] => https://www.w3resource.com/video-tutorial/javascript/list-of-tutorial.php [69] => https://www.w3resource.com/web-development-tools/firebug-tutorials.php [70] => https://www.w3resource.com/web-development-tools/useful-web-development-tools.php [71] => https://www.facebook.com/w3resource [72] => https://twitter.com/w3resource [73] => https://plus.google.com/+W3resource [74] => https://in.linkedin.com/in/w3resource [75] => https://feeds.feedburner.com/W3resource [76] => https://www.w3resource.com/ruby-exercises/ [77] => https://www.w3resource.com/graphics/matplotlib/ [78] => https://www.w3resource.com/python-exercises/numpy/index.php [79] => https://www.w3resource.com/python-exercises/pandas/index.php [80] => https://w3resource.com/plsql-exercises/ [81] => https://w3resource.com/swift-programming-exercises/ [82] => https://www.w3resource.com/angular/getting-started-with-angular.php [83] => https://www.w3resource.com/react/react-js-overview.php [84] => https://www.w3resource.com/vue/installation.php [85] => https://www.w3resource.com/jest/jest-getting-started.php [86] => https://www.w3resource.com/numpy/ [87] => https://www.w3resource.com/php/composer/a-gentle-introduction-to-composer.php [88] => https://www.w3resource.com/php/PHPUnit/a-gentle-introduction-to-unit-test-and-testing.php [89] => https://www.w3resource.com/laravel/laravel-tutorial.php [90] => https://www.w3resource.com/oracle/index.php [91] => https://www.w3resource.com/redis/index.php [92] => https://www.w3resource.com/cpp-exercises/ [93] => https://www.w3resource.com/r-programming-exercises/ [94] => https://w3resource.com/w3skills/ [95] => https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US [96] => https://www.w3resource.com/privacy.php [97] => https://www.w3resource.com/about.php [98] => https://www.w3resource.com/contact.php [99] => https://www.w3resource.com/feedback.php [100] => https://www.w3resource.com/advertise.php )
      

      希望,这会对某人有所帮助。这是一个要点 - https://gist.github.com/ManiruzzamanAkash/74cffb9ffdfc92f57bd9cf214cf13491

      【讨论】:

      • 如果classreltargetid、.. 在href 属性之前呢?
      猜你喜欢
      • 2011-06-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-30
      • 1970-01-01
      相关资源
      最近更新 更多