【问题标题】:unable to split a string that is parsed from a webpage?无法拆分从网页解析的字符串?
【发布时间】:2012-08-20 21:27:02
【问题描述】:

在这里,我正在解析一个网页并从该网页获取“名称”字段。下面是解析的代码:

foreach($dom->getElementsByTagName('table') as $table) {
    if($table->getAttribute('class')=='dataTable'){
        foreach($table->getElementsByTagName('tr') as $tr){
            if(isset($tr->getElementsByTagName('td')->item(0)->nodeValue)){
                $out[$i]['name'] = $tr->getElementsByTagName('td')->item(0)->nodeValue;
            }
        }
    }
}

在我正在解析的网页中,我有 name 的节点值,格式为“Mark&nbspSmith”。因此,当我得到结果时,IDE 中的“name”值为“Mark Smith”,命令提示符中的值为“Mark┬áSmith”。

现在我想以这样一种方式拆分 'name' 字符串,以便分别获得 firstname('Mark') 和 lastname('smith')。

我试过了:

explode(" ", $out[$i]['name']) as well as
explode(" " , $out[$i]['name'])

但似乎没有什么对我有用。帮我把字符串拆分为名字和姓氏?

希望我的问题很清楚。

【问题讨论】:

  • 你有一些奇怪的大括号惯例。
  • 我故意放了右大括号,因为代码没有完全完成..我正在执行一些与本主题无关的其他操作..所以省略了它..@PeeHaa

标签: php html parsing html-parsing string-split


【解决方案1】:

尝试修复 &nbsp 问题:

explode(chr(0xC2).chr(0xA0), $str)

UTF-8 中存在两个字节的不间断空格:0xC2 和 0xA0。

参考 - PHP Parsing Problem -   and Â

【讨论】:

  • 请考虑添加至少一些文字向 OP 解释,并让更多读者回答为什么以及如何回答原始问题。
【解决方案2】:

@user1518659 在这里试试这个,要解决这个问题,只需在传递给 DOMDocument 之前用空格替换  ,我还添加了 firstname last name 的拆分 :) 希望对您有所帮助。

<?php 
header('Content-Type: text/html; charset=utf-8'); //Required if your outputting, as the description contains utf-8 characters
//Load the source (input)
$html_source = file_get_contents('http://www.reuters.com/finance/stocks/companyOfficers?symbol=AOS');
$html_source = str_replace('&nbsp;',' ',$html_source);

//Dom document
$dom = new DOMDocument('1.0');
@$dom->loadHTML($html_source);

$out =array();
$i=0;
foreach($dom->getElementsByTagName('table') as $table) {
    if($table->getAttribute('class')=='dataTable'){

        foreach($table->getElementsByTagName('tr') as $tr){
            if(isset($tr->getElementsByTagName('td')->item(0)->nodeValue)){

                $out[$i]['fullname'] = $tr->getElementsByTagName('td')->item(0)->nodeValue;

                $name = explode(' ',$out[$i]['fullname']);
                $out[$i]['first_name'] = $name[0];
                $out[$i]['last_name'] = $name[1];

                if(!isset($tr->getElementsByTagName('td')->item(2)->nodeValue)){

                    foreach ($out as $key=>$value){
                        if($value['fullname'] == $tr->getElementsByTagName('td')->item(0)->nodeValue &&
                        !is_numeric(substr($tr->getElementsByTagName('td')->item(1)->nodeValue,0,1)) && 
                        $tr->getElementsByTagName('td')->item(1)->nodeValue != "--" ){
                            $out[$key]['description']= $tr->getElementsByTagName('td')->item(1)->nodeValue;
                        }
                    }

                }else{
                    if(!isset($tr->getElementsByTagName('td')->item(2)->nodeValue)){continue;}
                    if(isset($tr->getElementsByTagName('td')->item(3)->nodeValue)){
                        $out[$i]['age']= $tr->getElementsByTagName('td')->item(1)->nodeValue;
                        $out[$i]['since']= $tr->getElementsByTagName('td')->item(2)->nodeValue;
                        $out[$i]['position']= $tr->getElementsByTagName('td')->item(3)->nodeValue;
                    }
                }
                $i++;
            }
        }
    }
}

//Clean up
$return = array();
foreach ($out as $key=>$row){
    if(isset($row['fullname']) && isset($row['age']) && isset($row['since']) && isset($row['position']) && isset($row['description'])){
        $return[$key] = $out[$key];
    }
}


print_r($return);

/*
Array
(
    [0] => Array
        (
            [fullname] => Paul Jones
            [first_name] => Paul
            [last_name] => Jones
            [age] => 63
            [since] => 2011
            [position] => Chairman of the Board, Chief Executive Officer
            [description] => Mr. Paul W. Jones serves as the Chairman of the Board, Chief Executive Officer of A. O. Smith Corp. He has been a director of company since 2004. He is a member of the Investment Policy Committee of the Board. He was elected chairman of the board, president and chief executive officer effective December 31, 2005. He was president and chief operating officer from 2004 to 2005. Prior to joining the company, he was chairman and chief executive officer of U.S. Can Company, Inc. from 1998 to 2002. He previously was president and chief executive officer of Greenfield Industries, Inc. from 1993 to 1998 and president from 1989 to 1992. Mr. Jones has been a director of Federal Signal Corporation since 1998, where he chairs the Nominating and Governance Committee and is a member of the Compensation and Benefits Committee and the Executive Committee, and Integrys Energy Group, Inc. since 2011, where he is a member of the Compensation and Financial Committees. He was also a director of Bucyrus International, Inc. from 2006 until its acquisition by Caterpillar, Inc. in 2011, and chaired the Compensation Committee.
        )

    [1] => Array
        (
            [fullname] => Ajita Rajendra
            [first_name] => Ajita
            [last_name] => Rajendra
            [age] => 60
            [since] => 2011
            [position] => President, Chief Operating Officer, Director
            [description] => Mr. Ajita G. Rajendra serves as the President, Chief Operating Officer and Director of A. O. Smith Corp. He was elected a director of company in December 2011, based on the recommendation of the Nominating and Governance Committee, following his election as President and Chief Operating Officer in September 2011. Mr. Rajendra joined the company as President of A. O. Smith Water Products Company in 2005, and was named Executive Vice President of the company in 2006. Prior to joining the company, Mr. Rajendra was Senior Vice President at Kennametal, Inc., a manufacturer of cutting tools, from 1998 to 2004. Mr. Rajendra also serves on the board of Donaldson Company, Inc., where he is a member of the Audit Committee and Human Resources Committee. Further, Mr. Rajendra was a director of Industrial Distribution Group, Inc. from 2007 until its acquisition by Eiger Holdco, LLC in 2008.
        )
        ...
        ...
*/
?>

【讨论】:

  • 感谢@Lawrece Cherone ..那个工作就像一个魅力.. +1
  • 对不起,没有冒犯:),随时欢迎您在这里@SO提问。
  • 现在我正在尝试将从结果数组中获取的数据插入到 mysql 数据库中。我想知道是否可以通过查询一次插入大约 1000 行,而不是在一英里长的字符串末尾附加每个值然后执行它。
  • 如果您使用 PDO,那么您可以轻松创建一个模型来处理您的数据库输入,您需要使用目前的代码打开一个新问题。
  • 同样在结果数组中,在描述字段中,很少有细节采用无效格式。例如。原始描述在网页中包含(“Bemis”),但在解析结果中显示为(├ó┬Ç┬£Bemis├ó┬Ç┬¥)。检查 url 的页面来源。可能是什么原因..如何解决它.?我也试过 $html_source = str_replace('“','"',$html_source); $html_source = str_replace('”','"',$html_source);但无法正确处理。
【解决方案3】:

您拥有的不间断空格字符实体已损坏。应该是&amp;nbsp;。注意分号。

要回答您的问题,您只需这样做:

var_dump(explode('&nbsp', $out[$i]['name']));

或者如果实体是固定的:

var_dump(explode('&nbsp;', $out[$i]['name']));

【讨论】:

  • 仍然无法得到它.. 现在我正在尝试的是 $out[$i]['name'] = $tr->getElementsByTagName('td')->item(0) ->节点值; $chars = explode('&nbsp', $out[$i]['name']); $out[$i]['f​​_name'] = $chars[0]; $out[$i]['l_name'] = $chars[1];我得到的结果是 [name] => Gene Wulf [f_name] => Gene Wulf [l_name] => 它没有分裂。
  • Works for me。因此,您在代码中作为示例给出的名称不是真正的字符串,或者您的代码中有其他错误。
  • @user1518659:考虑一下我的回答。
【解决方案4】:

应该这样做

<?php

$a = preg_split( "/[\s]|[&nbsp;]|[ ]/", $out[$i]['name'] ); //$a is an array with keys: Mark, Smith

explode() 只允许一个分隔符分割字符串/数组,而preg_split() 用于多种分隔符

【讨论】:

  • 这几乎是完美的,但 php 无法识别 '┬'。当我尝试包含“┬”时,它只包含 [-á] 而不是 [┬á]。如何解决这个问题?
  • “Mark┬áSmith”被拆分为“Mark┬”和“Smith”,但“Gene┬áWulf”被拆分为“Ge”和“e┬”以及“Mathias┬áSandoval”作为 'Mathia' 和 '┬' .. 不是您所知道的预期结果.. 我们错过了什么?
猜你喜欢
  • 2023-04-01
  • 2012-04-27
  • 2013-04-14
  • 1970-01-01
  • 1970-01-01
  • 2014-01-02
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多