我试图用pattern matching 解决这个问题。这样你就可以注释网页本身的来源,并将其作为匹配的样本,你不需要编写特殊的规则。
例如,如果您查看此页面的源代码,您会看到:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>
然后您删除您的文本并添加{.} 以将该地点标记为相关并获取:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}
(通常你也需要结束标签,但对于单个元素则不需要)
然后将其作为模式传递给 Xidel(SO 似乎阻止了默认用户代理,因此需要更改),
xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)" -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'
它会输出你的文字
Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?
I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.
I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.
site_body = selector.xpath('//body').extract_first()
But doing NLP over such kind of content will not be very precise.
So is there any other tools or approaches for doing such tasks?