抓取页面关键字、描述和标题的功能？答案

【问题标题】：function to scrape page keywords , description and title?抓取页面关键字、描述和标题的功能？
【发布时间】：2012-06-15 03:30:19
【问题描述】：

我编写了简单的 3 个函数来抓取简单 html 页面的标题、描述和关键字这是第一个抓取标题的函数

function getPageTitle ($url)
{
    $content = $url;
    if (eregi("<title>(.*)</title>", $content, $array)) {
        $title = $array[1];
        return $title;
    }
}

而且效果很好这些是 2 个用于抓取描述和关键字的功能以及那些不起作用的功能

function getPageKeywords($url)
{
    $content = $url; 
    if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+keywords[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) { 
        $keywords = $array[1];  
        return $keywords; 
    }  
}
function getPageDesc($url)
{
    $content = $url; 
    if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+description[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) { 
        $desc = $array[1];  
        return $desc; 
    }  
}

我知道 preg_match 行可能有问题，但我真的不知道我试过很多东西，但它不起作用

【问题讨论】：

请注意：eregi 已弃用。 php.net/manual/en/function.eregi.php
使用正则表达式解析 HTML 会遇到比简单标签对更复杂的事情；当您尝试开始解析标签属性时，您需要切换到 PHP Dom：php.net/manual/en/book.dom.php 问题是名称、描述和内容属性必须按照您匹配的顺序。
第三个重点，仅仅因为它在网页上并不意味着您有权以任何您喜欢的方式使用数据（未经许可。
你试过Simple HTML DOM parser吗？这就像 jQuery DOM 解析。
Tony the Pony 来接你了……他饿了。

标签： php regex web-scraping

【解决方案1】：

为什么不使用 get_meta_tags？ PHP Documentation Here

<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');

// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author'];       // name
echo $tags['keywords'];     // php documentation
echo $tags['description'];  // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>

注意您可以将参数更改为 URL、本地文件或字符串。

【讨论】：

【解决方案2】：

最好使用php的原生DOMDocument来解析HTML然后是正则表达式，你也可以使用，但是在这个时代分配的网站甚至不再添加关键字，描述标签，所以你不能总是依赖它们在那里。但是，您可以使用 DOMDocument 来做到这一点：

<?php 
$source = file_get_contents('http://php.net');

$dom = new DOMDocument("1.0","UTF-8");
@$dom->loadHTML($source);
$dom->preserveWhiteSpace = false;

//Get Title
$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;

$description = '';
$keywords = '';
foreach($dom->getElementsByTagName('meta') as $metas) {
    if($metas->getAttribute('name') =='description'){ $description = $metas->getAttribute('content'); }
    if($metas->getAttribute('name') =='keywords'){    $keywords = $metas->getAttribute('content');    }
}

print_r($title);
print_r($description);
print_r($keywords);
?>

【讨论】：