【问题标题】:how can I use scrapy to extract data from webpage that has a database如何使用scrapy从具有数据库的网页中提取数据
【发布时间】:2013-06-26 16:03:17
【问题描述】:

您好,我正在使用 scrapy 访问我们的 Intranet 网站并进行一些抓取,一切似乎都在工作我能够访问它,但是当我将数据提取到 csv 文件中时,csv 文件为空我没有收到任何错误我运行这个。 HTML 中的每一列(目录号、设备 ID、订阅者 ID 等)都有数据,所以我如何使用 scrapy 获取这些数据。

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from carrier.items import CarrierItem

class CarrierSpider(InitSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'txtUserName': 'xxxx', 'txtPassword': 'secret'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "logout" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..

        return self.initialized() 

    else:
        self.log("\n\n\nFailed, Bad password :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//form[@id=\'listForm\']/table/')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('thead/th/a/text()').extract()
        item['link'] = site.select('thead/th/a/@href').extract()
        items.append(item)
    return items

来自页面的html

<table width="100%" cellspacing="0" cellpadding="2" border="0">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td valign="top" align="left">
<html>
<html>
<div class="clr"></div>
<table cellspacing="1" cellpadding="0" border="0">
<tbody>
<tr>
<tr>
<td align="left" colspan="2">
<div class="list">
<form id="listForm" name="listForm" method="POST" action="">
<table>
<thead>
<th class="first"></th>
<th>
<a href="/dis/?&orderby=mdn&order=asc">Directory Number</a>
</th>
<th>
<a href="/dis/?&orderby=hardwareId&order=asc">Equipment ID</a>
</th>
<th>
<a href="/dis/?&orderby=subscriberId&order=asc">Subscriber ID</a>
</th>
<th class="arrow_up">
<th>
<a href="/dis/?&orderby=sessUpldTime&order=asc">Session Upload Time</a>
</th>
<th>
<a href="/dis/?&orderby=upldRsn&order=asc">Upload Reason</a>
</th>
<th>
<a href="/dis/?&orderby=prof&order=asc">Profile ID</a>
</th>
<th>
<img width="1" height="1" src="/dis/img/spacer.gif" alt="">
</th>
<th class="last">
<img width="1" height="1" src="/dis/img/spacer.gif" alt="">
</th>
</thead>
<tbody>
</table>
</form>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>

运行爬虫时的输出

C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
-t csv
2013-06-28 14:10:41-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
2013-06-28 14:10:41-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-06-28 14:10:42-0500 [dis] INFO: Spider opened
2013-06-28 14:10:42-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
 items (at 0 items/min)
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-06-28 14:10:42-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/login.jsp> (referer: None)
2013-06-28 14:10:42-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
is/login>
2013-06-28 14:10:43-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
.jsp)
2013-06-28 14:10:43-0500 [dis] DEBUG:


    Successfully logged in. Let's start crawling!



2013-06-28 14:10:44-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-06-28 14:10:44-0500 [dis] DEBUG:


     We got data!



2013-06-28 14:10:44-0500 [dis] INFO: Closing spider (finished)
2013-06-28 14:10:44-0500 [dis] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1382,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 146604,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 3,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 6, 28, 19, 10, 44, 469000),
     'log_count/DEBUG': 12,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 3,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 6, 28, 19, 10, 42, 15000)}
2013-06-28 14:10:44-0500 [dis] INFO: Spider closed (finished)

C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>

【问题讨论】:

  • 是您在浏览器中看到的 html 还是从您的抓取中得到的?
  • 没关系。 alecxe 有正确的答案。
  • @ejk314 太好了,别忘了采纳答案,谢谢。

标签: python database web-scraping screen-scraping scrapy


【解决方案1】:

问题在于您的 xpath 表达式,它们应该是相对的 (.//):

item['title'] = site.select('.//thead/th/a/text()').extract()
item['link'] = site.select('.//thead/th/a/@href').extract()

更新:

在聊天中讨论问题后发现scrapy接收到的页面是页面加载后使用js转换的xml。

以下是帮助解析 xml 以获取必要数据的方法:

def parse(self, response):
    xhs = XmlXPathSelector(response)

    columns = hxs.select('//table[3]/header/column'')
    for column in columns:
        item = CarrierItem()
        item['title'] = column.select('.//text()').extract()
        item['link'] = column.select('.//@uri').extract()
        yield item

【讨论】:

  • 将 (.//) 添加到标题和链接 xpath 我收到错误文件 "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy \sel ector\lxmlsel.py",第 47 行,在 select raise ValueError("Invalid XPath: %s" % xpath) exceptions.ValueError: Invalid XPath: //form[@id='listForm']/table/跨度>
  • 当然,最后用斜线:将sites = hxs.select('//form[@id=\'listForm\']/table/') 替换为sites = hxs.select('//form[@id=\'listForm\']/table')
  • 我没有收到任何错误,但我的 csv 文件中没有显示任何内容。在我的 csv 中,我会得到正确的列名和链接,或者只是数据
  • 嗯,其他一切看起来都不错。你能检查一下它是否真的使用//form[@id="listForm"]/table xpath 找到了smth?
  • 好的,我将其更改为 //form[@id="listForm"]/table 似乎没有任何效果我得到相同的结果没有错误但没有任何输出
猜你喜欢
  • 2022-10-12
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-11-07
  • 2016-04-29
  • 1970-01-01
  • 2023-03-09
相关资源
最近更新 更多