【问题标题】:How to reset crawlee URL cache?How to reset crawlee URL cache?
【发布时间】:2022-12-27 13:28:34
【问题描述】:

I'm running a crawler that is called via an expressjs call.

When I call the same route again, my crawler runs again but shows that all routes have already finished. I'm even removing the './storage' folder

I read the documentation but can't seem to get the purgeDefaultStorages() to work.

How would I go about "resetting" crawlee so that there's no cached results?

import express from 'express'
import { PlaywrightCrawler, purgeDefaultStorages, enqueueLinks, Configuration } from 'crawlee';

const app = express();

let crawler

let run = async () => {
    const config = new Configuration({ 'persistStorage': false, persistStorage: false }); //tested with quotes and no quotes.
    Configuration.set('persistStorage', false) //add this direct config to see if that might work too.
     crawler = new PlaywrightCrawler({
        launchContext: {
            launchOptions: {
                headless: true,
            },
        },

    }, config);
    crawler.router.addDefaultHandler(async ({ request, page, enqueueLinks }) => {
        console.log(`Title of ${request.loadedUrl} ': img: ${request.id}`);
        await enqueueLinks({
            strategy: 'same-domain'
        });
    });

    await crawler.run(['http://localhost:8088/']);

    try {
        await config.getStorageClient().purge()
        await config.getStorageClient().teardown() //tested adding this too just incase.
        console.log('purging')
    } catch (e) {
        console.log(e)
    }
}

app.get('/', async (req, res) => {
    try {
        await run();
        res.status(200)
    } catch (e) {
        res.status(500)
    }

});

const PORT = process.env.PORT || 8889;
app.listen(PORT, () => {
    console.log(
        `The container started successfully and is listening for HTTP requests on ${PORT}`
    );

【问题讨论】:

    标签: javascript express crawlee


    【解决方案1】:

    you will have to use purgeOnStart in addition to purgeDefaultStorages or you can just set environment variables and you are good to go Set CRAWLEE_PURGE_ON_START environment variable to 0 or false Set CRAWLEE_PERSIST_STORAGE environment variable to 0 or false

    this will solve your current issue here is the link for your reference

    【讨论】:

      猜你喜欢
      • 2023-01-06
      • 2022-12-02
      • 2022-12-02
      • 2011-12-23
      • 2022-12-02
      • 2022-12-02
      • 2022-12-01
      • 2011-04-06
      • 2022-12-16
      相关资源
      最近更新 更多