【问题标题】:Calling nightmarejs in a loop循环调用 nightmarejs
【发布时间】:2021-11-16 20:10:41
【问题描述】:

我目前正在尝试从 gartner peer 洞察中抓取一些数据 这是示例 URL:-GPI

如果您看到 I want to scrape ,则通过遍历 ul 列表,查看评论简短描述和每条评论的详细描述。这是使用 nightmarejs 和cheerio 完成的

我有以下代码:

const got = require('got');
import { html } from 'cheerio';
import cheerio = require('cheerio');
import { children } from 'cheerio/lib/api/traversing';

const Nightmare = require('nightmare');


import { AppConfigService } from './../src/modules/common/services/app-config/app-config.service';
import { APP_CONST } from './../src/modules/common/utils/app.constant';
import { HtmlTestService } from './../src/modules/scrapper/utils/html-test.service';

const ngjs = new Nightmare({ show: true, waitTimeout: 1800000, gotoTimeout: 1800000, loadTimeout: 1700000, executionTimeout: 1800000 });
    


(async function() {
const reviewsMainPageUrl = 'https://gartner.com' + reviewRelativePath; // Please assume the URL provided above

    const respBody = await getRespFromWebScrapingApi(reviewsMainPageUrl);
    const new$ = cheerio.load(respBody);
    const completeNew = cheerio.load(new$.html());
    const data = completeNew('.uxd-truncate-text').text();
    //console.log('data:', data) // Just checking if I am getting proper data

    // Now there are two loops -  one for the reviews in a page and another one for the whole set of pages
    // const browser = await puppeteer.launch({headless: false})
    // const page = await browser.newPage();
    const readReviewList = completeNew('.read-review-link').children();
  // ReadreviewList is the one that I am planning to iterate over , albeit there is a cavaet described below as an end Note
   await scrapeFullReviewNMJS(reviewsMainPageUrl, readReviewList);

})()

async function scrapeFullReviewNMJS(reviewsMainPageUrl: string, readReviewList) {
    
    
    await scrapeOneReview(reviewsMainPageUrl)
}

async function ngjsResult(reviewsMainPageUrl, index=1) {
        console.log('call to result', index)
        return new Promise((resolve, reject) => {
            ngjs
            .goto(reviewsMainPageUrl)
            .wait('#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul')
            .click(`#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul > li:nth-child(${index}) > div > div.read-review-link > button`)
            .wait('#review-nav > li.active > a')
            .evaluate(() => {
                // getting scrape items from review detail page
                let reviwedCustomerDetails: Record<string, string> = {};

                let completeReviewDetails: Record<string, string> = {};
                let otherQAs: Record<string, string>[] = [];

                // construct completeREviewDetails
                const reviewTitle = document.querySelector('div.category.headline.condensed > h2')?.textContent;
                const overallRating = document.querySelector('div.avgStarIcon > span > span').getAttribute('style')?.substr(7)?.replace(/[%]/, '')?.replace(';', '');
                const reviewRatingUseful = document.querySelector('#review-helpful')?.textContent;
                const completeReview = document.querySelector('#sub-head > p > span.commentSuffix')?.textContent;

                completeReviewDetails = {
                    reviewTitle: reviewTitle,
                    reviewRatingUsefulNess: reviewRatingUseful,
                    reviewOverallRating: overallRating,
                    reviewCompleteDetail: completeReview,
                };

                // constrcut reviewerProfile
                const reviewerProfile = document.querySelector('#profile > div > div.user-info.row > div > div.reviewer-title.row > div.col-xs-10.title > span')?.textContent;
                const reviewerIndustry = document.querySelector('#industry > span')?.textContent;
                const reviewerRole = document.querySelector('#roles > span')?.textContent;
                const reviewerIndustrySize = document.querySelector('#companySize > span')?.textContent;
                const reviewerImplementationStratergy = document.querySelector('#profile > div > div.user-info.row > div >' + ' div:nth-child(3) > span')?.textContent;

                reviwedCustomerDetails = {
                    reviewerProfile: reviewerProfile,
                    reviewerIndustry: reviewerIndustry,
                    reviewerRole: reviewerRole,
                    reviewerIndustrySize: reviewerIndustrySize,
                    reviewerImplementationStratergy: reviewerImplementationStratergy,
                };
                // construct otherQAs
                return { rd: completeReviewDetails, rP: reviwedCustomerDetails, url: window.location.href };
            })
            
            .then( async data => {
                console.log('getting data:',data )
                
                resolve(data)
            }).catch(err => {
                console.log('err:', err)
                resolve({})
            })  
        })
        
    
    
}

async function scrapeOneReview(reviewsMainPageUrl) {
    let detailsList = []
    let proceed = true;
    let firstAttempt = true;
    
// THIS IS THE PART WITH the issue. Now when I call it as a single instance nightmarejs call, it works fine.
        const ds = await ngjsResult(reviewsMainPageUrl)
        
        detailsList.push(ds)
        console.log('compl:', detailsList)  
// But if I wanted to loop through it, there comes a problem, I have identified one way to go over this is using like below, but its not dynamic , 
        const ds = await ngjsResult(reviewsMainPageUrl, 1).then(async data => {
          return await ngjsResult(reviewsMaingPageUrl, 2) 
         })
        // .then() has to be appended for the entire list 
        detailsList.push(ds)
        console.log('compl:', detailsList)  
    
        
        
        return Promise.resolve(detailsList)
    }

注意:ul有一个列表

  • 和几个独立的 and ,所以不要假设 ul 元素的迭代只会给出所需的 li 项目 出于当前目的,让我们保持我想迭代前两个评论

    有没有合适的方法来循环这个并得到想要的结果?

    更新:

    我确实尝试过这样的 for 循环

    for(let i=1; i<15; i++) {
            const ds = await ngjsResult(reviewsMainPageUrl, i)
            detailsList.push(ds)
        }
        
    

    但我遇到了类似的错误

    Error: navigation error
        at unserializeError (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:162:13)
        at EventEmitter.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:89:13)
        at Object.onceWrapper (events.js:520:26)
        at EventEmitter.emit (events.js:400:28)
        at EventEmitter.emit (domain.js:470:12)
        at ChildProcess.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:49:10)
        at ChildProcess.emit (events.js:400:28)
        at ChildProcess.emit (domain.js:470:12)
        at emit (internal/child_process.js:910:12)
        at processTicksAndRejections (internal/process/task_queues.js:83:21) {
      code: -3,
      details: 'ERR_ABORTED',
      url: 'https://www.gartner.com/reviews/market/unified-communications-as-a-service-worldwide/vendor/ringcentral/product/ringcentral-office/reviews?marketSeoName=unified-communications-as-a-service-worldwide&vendorSeoName=ringcentral&productSeoName=ringcentral-office'
    }
    

    我知道像 2,4 这样的索引可能没有相应的 li 选择器(不会有 li:nth-child(2) 或 4),因为正如我上面所说,当我使用 chrome 进行调试时,我可以看到ul 元素在数组中有其他 html 元素,如 span 和 div。但上述错误适用于所有内容,即使是有效的 li 选择器,如 li:nth-child(3) 或 6,7,8...

  • 【问题讨论】:

      标签: node.js web-scraping nightmare


      【解决方案1】:

      你可以拥有多个 nightmare 实例,这会更快。

      根据我的经验,如果您有两种类型的抓取工具,则此问题更易于管理:

      1. 底座收集器刮刀
        这会收集所有基本页面详细信息和描述 URL。将它们放入一些存储中,也许是数据库。
      2. 描述网址刮刀
        此初始化检查尚未抓取的描述 url,然后运行以获取详细信息,可能是多个并行

      这包括一些维护记录的开销,但回报很好,有助于实现重试、最大文章收集、明智决策等机制

      要将其与Apache Nifi 完美混合,您可以在其中随时随地进行扩展(实时)。此外,如果您设计得当,所有生命体征/统计数据都很容易看到。

      【讨论】:

        猜你喜欢
        • 2017-06-30
        • 2017-03-15
        • 2018-03-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-01-08
        • 2015-04-28
        • 2018-02-16
        相关资源
        最近更新 更多