【问题标题】:scrapy and mysql刮痧和mysql
【发布时间】:2013-02-09 07:36:06
【问题描述】:

我正在尝试让scrapy将爬取的数据插入mysql,我的代码可以正常爬取并收集缓冲区中的数据,不会出错,但数据库永远不会更新。 '没有运气','没有错误'

管道.py

from twisted.enterprise import adbapi
import datetime
import MySQLdb.cursors

class SQLStorePipeline(object):

    def __init__(self):
        self.dbpool = adbapi.ConnectionPool('MySQLdb', db='craigs',
                user='bra', passwd='boobs', cursorclass=MySQLdb.cursors.DictCursor,
                charset='utf8', use_unicode=True)

    def process_item(self, items, spider):
        # run db query in thread pool
        query = self.dbpool.runInteraction(self._conditional_insert, items)
        query.addErrback(self.handle_error)

        return items

    def _conditional_insert(self, tx, items):
        # create record if doesn't exist.
        # all this block run on it's own thread
        tx.execute("select * from scraped where link = %s", (items['link'][0], ))
        result = tx.fetchone()
        if result:
            log.msg("Item already stored in db: %s" % items, level=log.DEBUG)
        else:
            tx.execute(\
                "insert into scraped (posting_id, email, location, text, title) "
                "values (%s, %s, %s, %s, %s)",
                (items['posting_id'][0],
                items['email'][1],
                items['location'][2],
                items['text'][3],
                items['title'][4],
                )

            )
            log.msg("Item stored in db: %s" % items, level=log.DEBUG)

    def handle_error(self, e):
        log.err(e)

爬取代码

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigs.items import CraigsItem

class MySpider(CrawlSpider):
    name = "craigs"
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')]

    def parse_profile(self, response):
        items = []
        img = CraigsItem()
        hxs = HtmlXPathSelector(response)
        img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract()
        img['posting_id'] = hxs.select('//html/body/article/section/section[2]/div/p/text()').extract()
        items.append(img)
        return items[0]
        return img[0]

settings.py

BOT_NAME = 'craigs' 
BOT_VERSION = '1.0' 
SPIDER_MODULES = ['craigs.spiders'] 
NEWSPIDER_MODULE = 'craigs.spiders' 
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

【问题讨论】:

  • 尝试在您的process_item_conditional_insert 函数中添加一个打印语句,以查看它们是否被调用。另外,您的 settings.py 文件是什么样的?
  • settings.pyBOT_NAME = 'craigs' BOT_VERSION = '1.0' SPIDER_MODULES = ['craigs.spiders'] NEWSPIDER_MODULE = 'craigs.spiders' USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION )
  • pipelines.py 中的其他打印语句从不反映正在执行的代码。

标签: mysql scrapy


【解决方案1】:

管道代码根本没有被调用的原因是它没有被激活。根据Item Pipelines page in the documentation,通过向settings.py 添加一个新部分来完成此激活。例如

ITEM_PIPELINES = [
    'craigs.pipeline.SQLStorePipeline',
]

此外,您的parse_profile 函数应该只返回img。如果单个响应页面会导致多个项目,您只需添加要返回的项目列表。

【讨论】:

  • 这个答案是否有助于解决问题?
  • 在 Akhter Wahab 建议的编辑之后编辑了答案。干杯老兄!
【解决方案2】:

在设置中激活流水线并使用yield代替return

【讨论】:

    【解决方案3】:

    您应该COMMIT 当前事务,这会使更改永久化。

    之后

    tx.execute(\
                "insert into scraped (posting_id, email, location, text, title) "
                "values (%s, %s, %s, %s, %s)",
                (items['posting_id'][0],
                items['email'][1],
                items['location'][2],
                items['text'][3],
                items['title'][4],
                )
    
            )
    

    你必须

    db.commit()
    

    db 是这样的

    db = MySQLdb.connect(host="localhost",user = "root", passwd = "1234", db="database_name")
    

    请试一试。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-03-14
      • 1970-01-01
      • 2016-01-25
      • 2022-01-14
      • 1970-01-01
      • 1970-01-01
      • 2022-08-20
      相关资源
      最近更新 更多