非结构化数据的网络爬虫答案

【问题标题】：Web-crawler for unstructured data非结构化数据的网络爬虫
【发布时间】：2016-03-19 00:06:44
【问题描述】：

是否有任何网络爬虫适用于解析许多非结构化网站（新闻、文章）并在没有预先定义的规则的情况下从中提取主要内容块？

我的意思是，当我解析新闻提要时，我想从每篇文章中提取主要内容块来做一些 NLP 工作。我有很多网站，要花很长时间来研究他们的 DOM 模型并为每个网站编写规则。

我试图使用 Scrapy 并获取所有没有标签和脚本的文本，放在一个正文中，但它包含许多不相关的东西，如菜单项、广告块等。

site_body = selector.xpath('//body').extract_first()

但是对这类内容做 NLP 不会很精确。

那么有没有其他工具或方法来完成这些任务？

【问题讨论】：

您尝试过视觉方法吗？我建议检查portia

标签： web-scraping scrapy web-crawler nlp

【解决方案1】：

我试图用pattern matching 解决这个问题。这样你就可以注释网页本身的来源，并将其作为匹配的样本，你不需要编写特殊的规则。

例如，如果您查看此页面的源代码，您会看到：

<td class="postcell">
<div>
    <div class="post-text" itemprop="text">

<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>

然后您删除您的文本并添加{.} 以将该地点标记为相关并获取：

<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}

（通常你也需要结束标签，但对于单个元素则不需要）

然后将其作为模式传递给 Xidel（SO 似乎阻止了默认用户代理，因此需要更改），

xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)"  -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'

它会输出你的文字

Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?

I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.

I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.

site_body = selector.xpath('//body').extract_first()


But doing NLP over such kind of content will not be very precise.

So is there any other tools or approaches for doing such tasks?

【讨论】：

使用这种方法，您仍然需要定义所有这些 div 块及其 ID。不适用于数百个网站。
这个想法是，您不必编写它们，只需从网页中复制它们即可。我将这种方法用于 200 多个图书馆网页

【解决方案2】：

您可以在 parse() 和 get_text() 中使用 Beautiful Soup：

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(response.body, 'html.parser')

yield {'body': soup.get_text() }

您也可以手动删除不需要的内容（如果您发现自己喜欢某些标记，例如 <H1>'s 或 <b>'s 可能是有用的信号）

# Remove invisible tags
#for i in soup.findAll(lambda tag: tag.name in ['script', 'link', 'meta']):
#     i.extract()

您可以做类似的事情来将一些标签列入白名单。

【讨论】：