【问题标题】:Screen Scraping PHP using SimpleHTMLDom使用 SimpleHTMLDom 截屏 PHP
【发布时间】:2017-01-27 05:33:55
【问题描述】:

试图筛选 php 中的废品内容并分配给一个数组。我需要使用库“SimpltHTMLDom”参考以下数据:Parse html table using file_get_contents to php array

期望的结果:

  • <text> (css) 中的医院名称背景颜色(如果存在!!!)
    • 需要全部五个<text>(如果没有背景颜色则为空)

数组:

Hospital 1    
--> NULL    
--> #ff0000    
--> 08:50    
--> NULL    
--> NULL

Hospital 2    
--> #ffff00    
--> 08:50    
--> NULL    
--> NULL    
--> NULL

PHP:

 <?php
 require('simple_html_dom.php');
 $table = array();


$html = file_get_html('https://www.miemssalert.com/chats/Default.aspx?hdRegion=3');
foreach($html->find('table#tblHospitals tr') as $row) {
   $hospital = $row->find('td.Chats',0)->plaintext;
   $color = $row->getAttribute('td.Chats style',2);
   $time = $row->find('td.Chats',2)->plaintext;
   //$text = $row->getAttribute('alt');

$table[$hospital][$color][$time][$text] = true;

}

 echo '<pre>';
 print_r($table);
 echo '</pre>';
?>

DOM 的 HTML(这是页面的小样本):

    <div id="Page1" style="display: none; width: 100%;">
                                <div id="HospitalUpdatePanel">

                                        <table id="tblHospitals" cellspacing="0" cellpadding="1" align="Left" rules="all" border="1" style="border-color:Black;border-width:1px;border-style:Solid;width:100%;border-collapse:collapse;table-layout: fixed;">
        <tr>
            <th title="Hospital" class="Chats" style="background-color:Silver;font-weight:bold;width:25%;">Hospital</th><th title="The emergency department temporarily requests that it receive absolutely no patients in need of urgent medical care. Yellow alert is initiated because the Emergency dept is experiencing a temporary overwhelming overload such that priority II and III patients may not be managed safely. Prior to diverting pediatric patients, medical consultation is advised for pediatric patient transports when emergency departments are on yellow alert." class="Chats" style="font-weight:bold;width:9%;background-color:#ffff00;color:#000000;">Yellow Alert</th><th title="The hospital has no ECG monitored beds available. These ECG monitored beds will include all in-patient critical care areas and telemetry beds." class="Chats" style="font-weight:bold;width:9%;background-color:#ff0000;color:#000000;">Red Alert</th><th title="The emergency department reports that their facility has, in effect, suspended operation and can receive absolutely no patients due to a situation such as a power-outage, fire, gas leak, bomb scare, etc." class="Chats" style="font-weight:bold;width:9%;background-color:#006600;color:#ffffff;">Mini Disaster</th><th title="An ALS/BLS unit is being held in the emergency department of a hospital due to lack of an available bed. (This does not replace Yellow Alert.)" class="Chats" style="font-weight:bold;width:9%;background-color:#ff6600;color:#000000;">ReRoute</th><th title="The hospital's ability to function as a trauma center has been exceeded. (This decision is at the discretion of the facility.)" class="Chats" style="font-weight:bold;width:9%;background-color:#9933cc;color:#ffffff;">Trauma ByPass</th><th title="The hospital's capacity has been exceeded." class="Chats" style="font-weight:bold;width:9%;background-color:#000000;color:#ffffff;">Capacity</th>
        </tr><tr>
            <td class="Chats">Anne Arundel Medical Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Baltimore Washington Medical Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Bon Secours Hospital</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Carroll Hospital Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Franklin Square (MedStar)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Good Samaritan Hospital (MedStar)</td><td class="Chats"></td><td class="Chats" style="background-color:#ff0000;color:#000000;">08:50</td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Greater Baltimore Medical Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Harbor Hospital (MedStar)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Harford Memorial Hospital (UMUCH)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Howard County General Hospital (JHM)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Johns Hopkins Bayview Medical Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Johns Hopkins Hospital</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Johns Hopkins Hospital (Pediatric ED)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Mercy Medical Center</td><td class="Chats" style="background-color:#ffff00;color:#000000;">08:50</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Midtown (UM)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Northwest Hospital</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">R Adams Cowley Shock Trauma Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td>
        </tr><tr>
            <td class="Chats">Sinai Hospital of Baltimore</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">St. Agnes Hospital</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">St. Joseph’s  (UM)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Union Memorial Hospital  (MedStar)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">University of Maryland Medical Center</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr><tr>
            <td class="Chats">Upper Chesapeake Medical Center (UMUCH)</td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats"></td><td class="Chats-null"></td><td class="Chats-null"></td>
        </tr>
    </table>
                                        <span id="lblHospitalsErrorMessage" style="color:Red;font-weight:bold;visibility: hidden;"></span>

</div>
                            </div>

以上修订的 PHP: 这里是输出,还是不想要的结果???

[Good Samaritan Hospital (MedStar)] => Array
    (
        [0] => Array
            (
                [11:58] => Array
                    (
                        [0] => 1
                    )

            )

    )

【问题讨论】:

    标签: php simple-html-dom


    【解决方案1】:

    贴出的代码有几个问题:

    1. find() 方法采用 CSS 选择器,而不是 HTML 标记。如果你想找到&lt;table id="tblHospitals"&gt;,使用table#tblHospitals等等。
    2. foreach($html-&gt;find(table#tblHospitals') as $row) 将迭代单个表元素,而不是行。您可能想要使用选择实际行元素的选择器,例如:table#tblHospitals tr

    【讨论】:

    • 我刚刚得到:Array ( [] => Array ( [] => Array
    • 修改了有问题的 PHP:仍然只返回第一个医院名称和下面的空白元素?
    • 修改了有问题的 PHP:获取所有医院名称和时间,但仍需要 css 而非纯文本的 bacground-color?
    • 背景颜色在样式属性中。您可以为此使用getAttribute()。查阅他们的 API 文档:simplehtmldom.sourceforge.net/manual_api.htm
    • 你能举个例子吗?我无法将plaintext getAttribute('background-color') 混合在一起,以及将其放置在代码中的何处。我不确定哪个子元素会包含颜色?
    【解决方案2】:

    原来只有两行代码:

    <?php
    require('simple_html_dom.php');
    
    $html = file_get_html('https://www.miemssalert.com/chats/Default.aspx?hdRegion=3');
    foreach($html->find('table#tblHospitals tr td.Chats') as $e)
        echo $e->plaintext . $e->getAttribute('style') . '<hr>';
    ?>
    

    结果数组如下所示:

    array(37) {
      ["Anne Arundel Medical Center"]=>
      array(1) {
        [0]=>
        bool(true)
      }
      [""]=>
      array(1) {
        [0]=>
        bool(true)
      }
      ["Baltimore Washington Medical Center"]=>
      array(1) {
        [0]=>
        bool(true)
      }
      ["04:31"]=>
      array(1) {
        ["background-color:#ffff00;color:#000000;"]=>
        bool(true)
      }
      ["Bon Secours Hospital"]=>
      array(1) {
        [0]=>
        bool(true)
      }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-04-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-30
      相关资源
      最近更新 更多