【问题标题】:How to get a string in HTML page using PHP ad xpath (POST request?)如何使用 PHP 广告 xpath(POST 请求?)在 HTML 页面中获取字符串
【发布时间】:2017-12-19 19:43:10
【问题描述】:

我正在尝试抓取此网页...

https://www.aslteramo.it/SISWebOnLine/ProntoSoccorso.aspx

.... 使用 PHP 和 XPath 获取红色、黄色、绿色和白色圆圈下的数值。

(注意:如果您尝试浏览该页面,您可能会在该页面中看到不同的值......没关系..,它会发生动态变化......)

我正在尝试使用这个 PHP 代码示例来打印值...

<?php
    ini_set('display_errors', 'On');
    error_reporting(E_ALL);

    $url = 'http://www.aslteramo.it/SISWebOnLine/ProntoSoccorso.aspx';

    $xpath_for_parsing = '/html/body/div/form/div[3]/div[2]/div[3]/div/div/div[2]/table/tbody/tr[2]/td[4]/table/tbody/tr[1]/td';


    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    $colorWaitingNumber = $xpath->query($xpath_for_parsing);
    $theValue =  'N.D.';
    foreach( $colorWaitingNumber as $node )
    {
      $theValue = $node->nodeValue;
    }

    print $theValue;
?>

请注意,要获取 XPath 元素,您必须在浏览器中禁用 javascript,因为鼠标右键单击被禁用。

我看到页面中有一个POST请求...

....但我不知道如何修改我的代码来执行请求,然后如何提取我的值...

我们将不胜感激。

提前谢谢你

【问题讨论】:

  • 这不仅仅是 一个 帖子,它是 帖子 的集合,其 post_data 已加密(并且可能已加密),特别是这样人们可以'不要做你想做的事。
  • 您实际上并没有解释问题所在。您无法卷曲页面?还是无法使用 xpath 定位元素?
  • 我的目标是使用 xpath 定位元素。如果我尝试使用 cURL(在 GET ... 中)调用该 url,它可以工作,但您无法在该页面中看到元素,因为它们是使用 POST 请求加载的,我不知道如何调用该请求使用卷曲 ...

标签: php xpath web-scraping


【解决方案1】:

我看到页面中有一个POST请求...

您无法获取数据是 POST 请求在页面加载时获取它。您需要执行相同的 POST 请求:

$curl = curl_init();

curl_setopt_array($curl, array(
  CURLOPT_URL => "https://www.aslteramo.it/SISWebOnLine/ProntoSoccorso.aspx",
  CURLOPT_RETURNTRANSFER => true,
  CURLOPT_ENCODING => "",
  CURLOPT_MAXREDIRS => 10,
  CURLOPT_TIMEOUT => 30,
  CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
  CURLOPT_CUSTOMREQUEST => "POST",
  // this is to emulate the page behavior
  CURLOPT_POSTFIELDS => "ctl00%24ScriptManager1=ctl00%24MainContent%24UpdatePanel1%7Cctl00%24MainContent%24Timer1&__EVENTTARGET=ctl00%24MainContent%24Timer1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTYxOTg2MDY2NA9kFgJmD2QWAgIDD2QWBgIDDzwrAA0CAA8WAh4LXyFEYXRhQm91bmRnZAwUKwAGBRMwOjAsMDoxLDA6MiwwOjMsMDo0FCsAAhYQHgRUZXh0BQ1Ib21lIHBhZ2UgQVNMHgVWYWx1ZQUNSG9tZSBwYWdlIEFTTB4LTmF2aWdhdGVVcmwFF2h0dHA6Ly93d3cuYXNsdGVyYW1vLml0HgdUb29sVGlwBRxQYWdpbmEgaW5pemlhbGUgZGVsIHNpdG8gQVNMHgdFbmFibGVkZx4KU2VsZWN0YWJsZWceCERhdGFQYXRoBRdodHRwOi8vd3d3LmFzbHRlcmFtby5pdB4JRGF0YUJvdW5kZ2QUKwACFhIfBWcfBmcfCGcfBwUhL3Npc3dlYm9ubGluZS9wcm9udG9zb2Njb3Jzby5hc3B4HwEFD1Byb250byBTb2Njb3Jzbx8CBQ9Qcm9udG8gU29jY29yc28fBAUeVGVtcGkgZCdhdHRlc2EgUHJvbnRvIFNvY2NvcnNvHghTZWxlY3RlZGcfAwUhL1NJU1dlYk9uTGluZS9Qcm9udG9Tb2Njb3Jzby5hc3B4ZBQrAAIWEB8BBQ5UZW1waSBkJ2F0dGVzYR8CBQ5UZW1waSBkJ2F0dGVzYR8DBSAvU0lTV2ViT25MaW5lL1RlbXBpRGlhdHRlc2EuYXNweB8EBShUZW1waSBkJ2F0dGVzYSBwcmVzdGF6aW9uaSBhbWJ1bGF0b3JpYWxpHwVnHwZnHwcFIC9zaXN3ZWJvbmxpbmUvdGVtcGlkaWF0dGVzYS5hc3B4HwhnZBQrAAIWEB8BBRZMaXN0YSBkJ0F0dGVzYSBFeC1Qb3N0HwIFFkxpc3RhIGQnQXR0ZXNhIEV4LVBvc3QfAwUpamF2YXNjcmlwdDpvcGVuV2ViRm9ybSgnV2ViRXhQb3N0LmFzcHgnKTsfBAUnTW9uaXRvcmFnZ2lvIExpc3RhIGQnQXR0ZXNhIC0gKEV4LVBvc3QpHwVnHwZnHwcFKWphdmFzY3JpcHQ6b3BlbndlYmZvcm0oJ3dlYmV4cG9zdC5hc3B4Jyk7HwhnZBQrAAIWEB8BBR5BdHRpdml0w6AgbGliZXJvLXByb2Zlc3Npb25hbGUfAgUeQXR0aXZpdMOgIGxpYmVyby1wcm9mZXNzaW9uYWxlHwMFHy9TSVNXZWJPbkxpbmUvQXR0aXZpdGFBbHBpLmFzcHgfBAUeQXR0aXZpdMOgIGxpYmVyby1wcm9mZXNzaW9uYWxlHwVnHwZnHwcFHy9zaXN3ZWJvbmxpbmUvYXR0aXZpdGFhbHBpLmFzcHgfCGdkZAIJDw8WAh8BBQ9Qcm9udG8gU29jY29yc29kZAILD2QWAgIBD2QWAmYPZBYGAgEPFgIfBWdkAgsPPCsADQBkAg0PFgIfBWdkGAMFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBSBjdGwwMCRNYWluQ29udGVudCRJbWdCdG5BZ2dpb3JuYQUVY3RsMDAkTWFpbkNvbnRlbnQkd3d3D2dkBRBjdGwwMCRuYXZpZ2F0aW9uDw9kBQ9Qcm9udG8gU29jY29yc29kTUucCs6%2BZyLbulTAFPNo569%2B%2BDE%3D&__VIEWSTATEGENERATOR=1A2B14D6&__EVENTVALIDATION=%2FwEWAgK27duvDwKDm%2B%2FCCycw%2FWHLOR5AmzLF035J86RYL0wa&__ASYNCPOST=true",
  CURLOPT_HTTPHEADER => array(
    "cache-control: no-cache",
    "content-type: application/x-www-form-urlencoded"
  ),
));

$response = curl_exec($curl);

然后是你的 XPATH:

$dom = new DOMDocument();
@$dom->loadHTML($data);

$xpath = new DOMXPath($dom);

希望对您有所帮助。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-06-09
    • 2023-02-08
    • 1970-01-01
    • 1970-01-01
    • 2017-12-31
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多