【问题标题】:Parsing an HTML page using curl and xpath in PHP在 PHP 中使用 curl 和 xpath 解析 HTML 页面
【发布时间】:2017-02-23 21:41:44
【问题描述】:

我需要解析此网页https://www.galliera.it/118 以获取彩色条下的数字。

这是我的代码(不起作用!!)...

<?php
    ini_set('display_errors', 1);

    $url = 'https://www.galliera.it/118';

    print "The url ... ".$url;
    echo '<br>';
    echo '<br>';

    //#Set CURL parameters ...
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    //print "Data ... ".$data;
    //echo '<br>';
    //echo '<br>';

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    // This is the xpath for a number under a bar ....
    // /html/body/div[2]/div[1]/div/div/ul/li[6]/span
    // How may I get it?
    // The following code doesn't work, it's only to show my goals ..

    $greenWaitingNumber = $xpath->query('/html/body/div[2]/div[1]/div/div/ul/li[6]/span');
    $theText = (string).$greenWaitingNumber;

    print "Data ... ".$theText;
    echo '<br>';
    echo '<br>';

?>

有什么建议/例子/替代方案吗?

【问题讨论】:

  • “那行不通”你能说得更具体点吗? (string).$greenWaitingNumber 也是错误的语法,你不能像这样回显 DOMElementSimpleXMLElement 可以在使用简单 XML 时)
  • 你是对的......对不起。白页和 Web 控制台显示“错误 500”。我认为问题在于 ... $theText = (string).$greenWaitingNumber; .... line nut 我不太确定 $xpath->query 是否正确(请注意,我使用 borwser 中的“Inspect element”交互功能获得了 xpath ...
  • 由于索引符号,您的 x-path 适用于特定值,但要获得所有这些值,您需要在开始时使用更通用的东西。/html/body/div/div/div/div/ul/li[6]/span
  • 好的,谢谢 .. 所以 ... $greenWaitingNumber = $xpath->query('/html/body/div[2]/div[1]/div/div/ul/li[ 6]/跨度');我想是正确的......在这种情况下我如何打印 $greenWaitingNumber 值?
  • $greenWaitingNumber = $xpath-&gt;query('/html/body/div[2]/div[1]/div/div/ul/li[6]/span'); $theText = $greenWaitingNumber[0]-&gt;nodeValue; 会给你“2”

标签: php parsing curl xpath web-scraping


【解决方案1】:

这是您的 php 脚本,该脚本是您在排序良好的数组中挖掘数据的请求,您可以查看脚本的结果并根据需要更改结构。干杯!

$html = file_get_contents("https://www.galliera.it/118");

$dom = new DOMDocument();
$dom->loadHTML($html);
$finder = new DOMXPath($dom);

// find all divs class row
$rows = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' row ')]");

$data = array();
foreach ($rows as $row) {
    $groupName = $row->getElementsByTagName('h2')->item(0)->textContent;
    $data[$groupName] = array();

    // find all div class box
    $boxes = $finder->query("./*[contains(concat(' ', normalize-space(@class), ' '), ' box ')]", $row);
    foreach ($boxes as $box) {
        $subgroupName = $box->getElementsByTagName('h3')->item(0)->textContent;
        $data[$groupName][$subgroupName] = array();

        $listItems = $box->getElementsByTagName('li');
        foreach ($listItems as $k => $li) {

            $class = $li->getAttribute('class');
            $text = $li->textContent;

            if (!strlen(trim($text))) {
                // this should be the graph bar so kip it
                continue;
            }

            // I see only integer numbers so I cast to int, otherwise you can change the type or event not cast it
            $data[$groupName][$subgroupName][] = array('type' => $class, 'value' => (int) $text);
        }
    }
}

echo '<pre>' . print_r($data, true) . '</pre>';

输出类似于:

Array
(
    [SAN MARTINO - 15:30] => Array
        (
            [ATTESA: 22] => Array
                (
                    [0] => Array
                        (
                            [type] => rosso
                            [value] => 1
                        )

                    [1] => Array
                        (
                            [type] => giallo
                            [value] => 12
                        )

                    [2] => Array
                        (
                            [type] => verde
                            [value] => 7
                        )

                    [3] => Array
                        (
                            [type] => bianco
                            [value] => 2
                        )

                )

            [VISITA: 45] => Array
                (
                    [0] => Array
                        (
                            [type] => rosso
                            [value] => 5
                        )
...

【讨论】:

    【解决方案2】:

    这可能有助于简化此特定实例的 xpath 语句。

    这将找到所有li 元素,其类属性与“verde”匹配,并且其下有一个span 元素。

    // 表示法表示“匹配文档中的任何级别”,因此您不必从根目录构建查询

    /* @var $node DOMElement */
    $greenWaitingNumber = $xpath->query('//li[@class="verde"]/span');
    foreach( $greenWaitingNumber as $node )
    {
      echo $node->nodeValue;
    }
    

    *注意这不会处理class="verde foo bar"


    如果您只对某个特定值感兴趣...

    $greenWaitingNumber = $xpath->query('/html/body/div[2]/div[1]/div/div/ul/li[6]/spa‌​n');
    $theText = $greenWaitingNumber[0]->nodeValue;
    

    这将打印“2”

    【讨论】:

      猜你喜欢
      • 2017-07-17
      • 1970-01-01
      • 2011-06-03
      • 2018-06-09
      • 1970-01-01
      • 2011-07-30
      • 1970-01-01
      • 1970-01-01
      • 2011-03-25
      相关资源
      最近更新 更多