是否有任何 php 函数来查找任何 url 标题或描述？答案

【问题标题】：Is there any php function to find any url title or description?是否有任何 php 函数来查找任何 url 标题或描述？
【发布时间】：2021-07-10 18:41:58
【问题描述】：

我是数据抓取的新手，我正在处理标题抓取的 url，实际上我想编写一个函数，将 url/link 作为 request，作为回报，我得到 <title> </title>、og:title、@ 987654328@等全部meta property

我正在尝试使用此功能仅抓取标题

/**
     * @param Request $request
     * @return \Illuminate\Http\JsonResponse
     *
     * @throws ValidationException
     */
    public function getTitle(Request $request)
    {
        $this->validate($request, [
            'link' => 'required',
        ]);

        $link = $request->input('link');

        $str = @file_get_contents($link);
        if(strlen($str)>0){
            $str = trim(preg_replace('/\s+/', ' ', $str));
            preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title);
            $result = $title[1];
        }

        return Response::json([
            'message' => 'Get title',
            'data'    => $result,
        ], \Symfony\Component\HttpFoundation\Response::HTTP_OK);
    }

路线

Route::post('request-title', 'BuyShipRequestController@getTitle');

示例我在输入字段中的要求：

Amazon-url

以及我想要回复的内容

<title>Amazon.com: Seagate Portable 2TB External Hard Drive Portable HDD – USB 3.0 for PC, Mac, PS4, &amp; Xbox - 1-Year Rescue Service (STGX2000400): Computers &amp; Accessories</title>

和

<meta name="description"/> , <meta name="title"/>, <meta name="keywords" /> , link

作为回报，我只想要那些元属性 content 或 value

【问题讨论】：

这能回答你的问题吗？ How to parse HTML in PHP? 不清楚你的问题是什么。你能描述一下getTitle()目前做错了什么吗？你在哪里需要帮助？请参阅stackoverflow.com/help/how-to-ask 了解更多信息。
不，我想要任何链接，任何网络链接来获取标题和所有元属性
getTitle() 不正确它只返回任何链接的标题有时它没有返回任何<title></title> 我想要所有元属性以及link 和title
我需要帮助找到带有title 和link 的任何链接的元属性content
"有 PHP 函数吗？"不。相反，就像其他人所说的那样，您必须解析 HTML。

标签： php laravel laravel-5 web-scraping metadata

【解决方案1】：

无需使用外部库的一种非常简单的方法是使用 XPath 来查询 HTML 文档：

XPath expression	Result
//div	Returns all div tags
//meta	Returns all meta tags
//meta[@name]	Returns all meta tags having a 'name' attribute

在 PHP 中，XPath 可通过DomXPath 获得。由于 XPath 在 DOM 树上工作，我们首先需要一个 DomDocument：

$dom = new DomDocument;
$dom->loadHTML($some_html);

$xpath = new DomXPath($dom);
$xpath->query(".//meta");

所以，鉴于您提供的文件...

$html = file_get_contents('amazon.html');

...我们可以编写一个基本函数来查询一组标签：

function get_from_html(string $html, array $tags) {

    $collect = [];

    // Turn off default error reporting so we're not drowning  
    // in errors when the HTML is malformed. We can get a 
    // hold of them anytime via libxml_get_errors().
    // Cf. https://www.php.net/libxml_use_internal_errors
    libxml_use_internal_errors(true);
    
    // Turn HTML string into a DOM tree.    
    $dom = new DomDocument;
    $dom->loadHTML($html);
    
    // Set up XPath
    $xpath = new DomXPath($dom);

    // Query the DOM tree for the given set of tags.
    foreach ($tags as $tag) {

        // You can do *a lot* more with XPath, cf. this cheat sheet:
        // https://gist.github.com/LeCoupa/8c305ec8c713aad07b14 
        $result = $xpath->query("//{$tag}"); 

        if ($result instanceof DOMNodeList) {

            $collect[$tag] = $result;
        }
    }

    // Clear errors to free up memory, cf.
    // https://www.php.net/manual/de/function.libxml-use-internal-errors.php#78236
    libxml_clear_errors();

    return $collect;
}

调用时...

$results = get_from_html($html, ['title', 'meta']);

...它返回一个可迭代的 DOMNodeList 对象数组，您可以轻松地对其进行进一步评估（例如，检查列表中所有节点的属性）：

// For demonstration purposes, just walk the results and turn each found node
// back to its HTML representation.
// 
// For real world stuff, cf.:
// - https://www.php.net/manual/en/class.domnodelist.php
// - https://www.php.net/manual/en/class.domnode.php
// - https://www.php.net/manual/en/class.domelement.php
if (!empty($results)) {

    foreach ($results as $key => $nodes) {

        if ($key == 'title') {

            $node = $nodes->item(0);
            
            // Get HTML, cf. https://stackoverflow.com/a/12909924/3323348
            // Output: <title>Amazon.com: Seagate (...)</title>
            var_dump($node->ownerDocument->saveHTML($node));
        }

        if ($key == 'meta') {

            foreach ($nodes as $node) {

                // Get HTML, cf. https://stackoverflow.com/a/12909924/3323348
                // Output: <meta (...)>
                var_dump($node->ownerDocument->saveHTML($node));

                // Or get an attribute
                if ($node->hasAttribute('name')) {
 
                    // Output: "keywords", or "description", or...  
                    var_dump($node->getAttribute('name'));
                }
            }
        }
    }
}

在 XPath 上：

https://github.com/code4libtoronto/2016-07-28-librarycarpentrylessons/blob/master/xpath-xquery/lesson.md（简介、教程）
https://devhints.io/xpath（备忘单）
https://gist.github.com/LeCoupa/8c305ec8c713aad07b14（备忘单）

【讨论】：