【发布时间】:2016-04-28 23:03:27
【问题描述】:
我正在尝试抓取网站上的广告...
以本站为例
我正在尝试从中获取广告
/html/body[@class='single single-post postid-171 single-format-standard custom-background hasGoogleVoiceExt']/div[@id='site']/div[@id='site-out ']/div[@id='site-fixed']/div[@id='content-out']/div[@id='content-in']/div[@id='main-content-wrap ']/div[@id='main-content-contain']/div[@id='content-wrap']/div[@class='sec-marg-out4 relative']/div[@class=' sec-marg-in4']/article[@class='post-171 帖子类型-帖子状态-发布格式-标准 hentry 类别-未分类']/div[@id='post-area']/div[@class='post-body-out']/div[@class='post-body-in']/div[@id='content-area']/div[@class='content-area-cont left relative' ]/div[@class='sec-marg-out relative']/div[@class='sec-marg-in']/div[@class='content-area-out']/div[@class= 'content-area-in']/div[@class='content-main left relative']/div[@id='article-ad']/div[1]/div[@id='ac_110238']/ div[@class='ac_adbox']/div[@class='ac_adbox_inner']
“ac_container”或“ac-adbox”
当我在浏览器中访问页面时,我会看到广告,当我使用 scrapy 获取 html 时
它是一个脚本
<div id="contentad110238"></div>
<script type="text/javascript">
(function(d) {
var params =
{
id: "d12cd6f3-b896-443b-9140-07e35e66e222",
d: "YmVzdHlsaW5nLmNvbQ==",
wid: "110238",
cb: (new Date()).getTime()
};
var qs=[];
for(var key in params) qs.push(key+'='+encodeURIComponent(params[key]));
var s = d.createElement('script');s.type='text/javascript';s.async=true;
var p = 'https:' == document.location.protocol ? 'https' : 'http';
s.src = p + "://api.content.ad/Scripts/widget2.aspx?" + qs.join('&');
d.getElementById("contentad110238").appendChild(s);
})(document);
</script> </div>
我怎么刮这个?任何帮助将不胜感激......我猜我必须在 python 或 scrapy 中使用 js 渲染器......推荐?
【问题讨论】:
标签: javascript python css web-scraping scrapy