【发布时间】:2013-04-05 00:02:34
【问题描述】:
我正在尝试从网页中提取数据以将其插入数据库。我感兴趣的数据在具有 class="company" 的 div 中。在一个网页上有 15 个或更少的这样的 div,并且我试图从中提取这些数据的页面有很多。出于这个原因,我试图找到一种自动提取数据的解决方案。
带有 class="company" 的 div 如下(一页有 15 个或更少这样的 div,不同的数据):
<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->
<div class="top clearfix">
<div class="name clearfix">
<h2>
<a href="/company-name">Company Name</a> <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
<a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->
</h2>
</div>
</div>
<div class="inner clearfix has-logo">
<div class="clearfix">
<div class="logo">
<a href="/company-name">
<img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
</a>
</div>
<div class="info">
<div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
<div class="clearfix">
<div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
</div>
</div>
</div>
<div class="actions-bar clearfix">
<ul>
<li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
<li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
<li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
</ul>
</div>
</div>
</div>
到目前为止,我有以下 PHP 代码($output 有网页的 HTML 代码):
<?php
$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false;
$xpath = new DomXPath($doc);
$elements = $xpath->query("//*[@class='company']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo $element->nodeValue;
}
}
?>
它似乎获得了 class="company" 的所有 15 个 div,但我不知道如何提取前面提到的(在 HTML 代码的 cmets 中)单个值。
每个 div(我说的是 class="company" 的 div)都没有写入 HTML 块中的所有值。因此,我必须以某种方式查询公司 div 中是否存在我感兴趣的数据的特定 div,如果存在,我必须检查它是否不为空(是否包含标签之间的文本)。如果它存在并且不为空,我将它添加到一个变量中。
一旦提取了值,我想将它们分配给 PHP 变量,以便之后我可以使用它们。如果将提取的值像这样放入数组中会更好:
$result = array(
// 1'st div's data
[0] =>
'company name' => 'company name',
'company link' => 'company link',
'company id' => 'company id',
'company branches' => 'branches link',
'company logo' => 'logo',
'company address' => 'address',
'company slogan' => 'slogan',
'company webpage' => 'webpage',
'company email' => 'email',
'company phone' => 'phone'
// 2'nd div's data
[1] =>
'company name' => 'company name',
'company link' => 'company link',
'company id' => 'company id',
'company branches' => 'branches link',
'company logo' => 'logo',
'company address' => 'address',
'company slogan' => 'slogan',
'company webpage' => 'webpage',
'company email' => 'email',
'company phone' => 'phone'
...
)
【问题讨论】:
-
等待它...我知道它来了...谁有“答案”我们都知道我的意思是什么!
标签: php html dom xpath extract