如何从网页中获取内容？答案

【问题标题】：How to fetch content from a webpage?如何从网页中获取内容？
【发布时间】：2009-07-14 08:54:12
【问题描述】：

我想从网页中获取 div 内容并在我的页面中使用它。

我有网址http://www.freebase.com/search?limit=30&start=0&query=cancer
我想获取 id 为 artilce-1001 的 div 内容。我如何在 php 或 jQuery 中做到这一点？

【问题讨论】：

标签： php jquery

【解决方案1】：

如果你想使用 PHP，你可能想看看Simple HTML DOM。这是一个很好的单个包含文件。 docs 给出了一个刮斜线的例子：

$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

Regex 永远不擅长（也不应该用于）解析 HTML。它不是正则的，你最终会得到巨大的正则表达式，在 jQuery 或上面的库中会很简单

编辑：
所以你会想要使用类似的东西

$html = file_get_html('http://www.freebase.com/search?limit=30&start=0&query=cancer');
$text = $html->find('div[id=artilce-1001]',0)->plaintext;

【讨论】：

那个简单的 HTML DOM 会让我改变我的上一个项目:) 为了更好:) 谢谢:)
@Antonio：如果你喜欢简单的 HTML DOM，请查看 phpQuery:code.google.com/p/phpquery。该项目实际上是支持的。
@Andrew Moore -> 感谢您的提醒。我会的。

【解决方案2】：

如果这真的是关于 Freebase 主题而不是从一般网站获取 HTML，使用 API 并熟悉 MQL 应该是更好的解决方案，因为这样可以限制您的搜索轻松地在特定类型中。

例子：

[{
  "/common/topic/article": {
    "guid":     null,
    "limit":    1,
    "optional": true
  },
  "/common/topic/image": {
    "id":       null,
    "limit":    1,
    "optional": true
  },
  "id":     null,
  "name":   null,
  "name~=": "*Cancer*",
  "type":   "/user/radiusrs/default_domain/astrological_sign"
}]

可以传递给 mqlread directly 并返回一个 JSON 列表，其中包含可能与星座“癌症”匹配的内容。然后，如果需要，您可以使用trans_raw 和/或trans_blurb 简单地获取文章和图像。 :)

【讨论】：

【解决方案3】：

在 PHP 中，您可能想要获取页面（可能使用 CURL 或类似方法），然后您必须解析 html，这可能不是最简单的事情，但我猜有一些库那里可以帮助您。

【讨论】：

【解决方案4】：

使用下面的

$("#LoadIntoThisDiv").load("http://www.freebase.com/search?limit=30&start=0&query=cancer #artilce-1001");

在jQuery网站here上有一个这样的例子

【讨论】：

这是否适用于提供页面的域以外的域？我认为不是。

【解决方案5】：

PHP：

$content = file_get_contents('http://www.freebase.com/search?limit=30&start=0&query=cancer');

$match = preg_match("#id=\"article-1001\".*</div>#", $content, $matches);

正则表达式可能行不通，但它是您可以使用的示例或方向，只是玩它:)

【讨论】：

该正则表达式不适用于给定的示例 - div 位于多行上，并且 .* 太贪心了
其实 .* 表示除了换行以外的所有字符，所以这就是它不起作用的原因，因为 div 关闭后表达式将终止。除此之外，我在过去了解到解析 HTML 是最难的部分之一......无论如何，我给了他一个例子，他仍然应该努力;）

【解决方案6】：

PHP 是服务器端，jQuery 是客户端，所以这真的取决于你想要实现的目标。另请注意，由于same-origin policy，您通常无法通过 javascript 向另一个域执行 Ajax 请求（但您可以通过自己的服务器代理它）

除了 jQuery，这是一种在 PHP 中实现的简单方法，适用于您提供的情况

$url="http://www.freebase.com/search?limit=30&start=0&query=cancer";
$html=file_get_contents($url);

if (preg_match('{<div id="article-1001".*?>(.*?)</div>}s', $html, $matches))
{
    $content=$matches[1];
}

注意 's' 修饰符，它使 .匹配换行符和 .*?成语，这使得匹配内部不贪婪，因此它只会吃掉下一个</div>

这适用于您的情况，但正则表达式通常不适合此任务。您可以将 HTML 加载到 DOmDocument 并以这种方式访问它。

$doc = new DOMDocument();
$doc->loadHTML($html);
$div=$doc->getElementById("article-1001");

【讨论】：