【问题标题】:Node Puppeteer, page.on( "request" ) throw a "Request is already handled!"Node Puppeteer, page.on("request") 抛出“请求已被处理!”
【发布时间】:2021-08-27 08:14:03
【问题描述】:

我正在使用 puppeteer-extra 和 node.js 来遍历多个 url。

我试图拦截一些资源类型以在每次迭代时加载,并得到以下错误。

PS C:\Users\someuser\Desktop\Project> node temp.js
-- running
C:\Users\someuser\node_modules\puppeteer\lib\cjs\puppeteer\common\assert.js:26
        throw new Error(message);
              ^

Error: Request is already handled!
    at Object.exports.assert (C:\Users\someuser\node_modules\puppeteer\lib\cjs\puppeteer\common\assert.js:26:15)
    at HTTPRequest.continue (C:\Users\someuser\node_modules\puppeteer\lib\cjs\puppeteer\common\HTTPRequest.js:217:21)
    at PuppeteerBlocker.onRequest (C:\Users\someuser\node_modules\@cliqz\adblocker-puppeteer\dist\cjs\adblocker.js:225:33)
    at BlockingContext.onRequest (C:\Users\someuser\node_modules\@cliqz\adblocker-puppeteer\dist\cjs\adblocker.js:64:47)
    at C:\Users\someuser\node_modules\puppeteer\lib\cjs\vendor\mitt\src\index.js:51:62
    at Array.map (<anonymous>)
    at Object.emit (C:\Users\someuser\node_modules\puppeteer\lib\cjs\vendor\mitt\src\index.js:51:43)
    at Page.emit (C:\Users\someuser\node_modules\puppeteer\lib\cjs\puppeteer\common\EventEmitter.js:72:22)
    at C:\Users\someuser\node_modules\puppeteer\lib\cjs\puppeteer\common\Page.js:143:100
    at C:\Users\someuser\node_modules\puppeteer\lib\cjs\vendor\mitt\src\index.js:51:62

我无法理解为什么该请求已被处理,因为实际请求 page.goto 是在 for 循环中完成的。有人有什么提示吗?

这是完整的项目

const puppeteer = require( 'puppeteer-extra' );

const StealthPlugin = require( 'puppeteer-extra-plugin-stealth' );
puppeteer.use( StealthPlugin() );

const AdblockerPlugin = require( 'puppeteer-extra-plugin-adblocker' );
puppeteer.use( AdblockerPlugin( { blockTrackers: true } ) );

puppeteer.launch( { headless: true } ).then( async browser => {

    console.log( '--\xa0running' );

    console.time( '--\xa0process' );

    const page = await browser.newPage();

    await page.setRequestInterception( true );
    
    page.on( 'request', ( request ) => {
        if ( [ 'image', 'stylesheet', 'font', 'script' ].indexOf( request.resourceType() ) ) {
            request.abort();
        } else {
            request.continue();
        };
    } );

    for ( var i = 1; i <= 20; i++ ) {

        console.time( '--\xa0iteration\xa0' + i ); // ... timer start 
    
        await page.goto( 'https://www.someurl.it/shop/s%2D' + i, { waitUntil: 'load' } );
    
        const title = await page.title();
    
        console.log( title.includes( '404' ) ? false : title );
    
        console.timeEnd( '--\xa0iteration\xa0' + i ); // ... timer end 
    
    };

    await browser.close();

    console.timeEnd( '--\xa0process' );
  
    console.log( '--\xa0ending' );

} );

【问题讨论】:

  • 如果在request.abort();request.continue(); 之前添加await 会怎样(因为这两种方法都返回承诺)?
  • 遗憾的是 await 并没有起到作用。 Github 线程很有趣,但没有解决问题。谢谢你们的cmets。

标签: javascript node.js puppeteer


【解决方案1】:

添加return语句解决了我的问题。

page.on( 'request', ( request ) => {
        if ([ 'image', 'stylesheet', 'font', 'script' ].indexOf( request.resourceType() ) !== -1 ) {
           return request.abort();
        }
        request.continue();
} );

【讨论】:

    【解决方案2】:

    必须对每个新页面进行资源拦截。

    这里是您可以拦截的资源的完整列表:stylesheetimagemediafontscripttexttrackxhrfetcheventsourcewebsocket, manifest, other.

    注意:
    大多数情况下,拦截所有资源可能会对您的爬虫产生负面影响。

    我建议只拦截 imagemediafont。 (在某些情况下拦截 stylesheet 可能会影响 puppeteer 点击操作)。

    示例

    /**
     * Puppeteer, Headless Chrome Node.js API
     * 
     * @link https://github.com/puppeteer/puppeteer
     * 
     * @package npm install puppeteer
     */
    const puppeteer = require( 'puppeteer' );
    
    const brewery = async ( page ) => {
    
        await page.setRequestInterception( true );
    
        page.on( 'request', r => {
    
            /**
             * @see https://stackoverflow.com/a/47166637/3645650
             */
            if ( [
                //'stylesheet', 
                'image', 
                'media', 
                'font',
            ].indexOf( r.resourceType() ) !== -1 ) {
    
                r.abort();
    
            } else {
    
                r.continue();
    
            };
    
        } );
    
    };
    
    ( async () => {
    
        // ... start
        let start = new Date();
        console.log( '--\xa0process:\xa0start' );
    
        const browser = await puppeteer.launch( { 
            headless: true 
        } );
    
        const page = await browser.newPage();
        
        await brewery( page );
    
        await page.goto( 'https://github.com/login' );
        await page.screenshot( { path: Date.now() + '.png' } );
        console.log( '--\xa0process:\xa0screenshot' );
    
        // ... end
        await browser.close().then( () => {
            var end = ( new Date() - start ) / 1000;
            console.log( '--\xa0process:\xa0end,\xa0runtime\xa0' + end + '\xa0seconds' );
        } );  
    
    } ) ()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-05-29
      • 2013-10-02
      • 2020-05-28
      • 2021-08-12
      • 2021-11-22
      • 1970-01-01
      • 2020-12-20
      相关资源
      最近更新 更多