循环调用 nightmarejs答案

【问题标题】：Calling nightmarejs in a loop循环调用 nightmarejs
【发布时间】：2021-11-16 20:10:41
【问题描述】：

我目前正在尝试从 gartner peer 洞察中抓取一些数据这是示例 URL：-GPI

如果您看到 I want to scrape ，则通过遍历 ul 列表，查看评论简短描述和每条评论的详细描述。这是使用 nightmarejs 和cheerio 完成的

我有以下代码：

const got = require('got');
import { html } from 'cheerio';
import cheerio = require('cheerio');
import { children } from 'cheerio/lib/api/traversing';

const Nightmare = require('nightmare');


import { AppConfigService } from './../src/modules/common/services/app-config/app-config.service';
import { APP_CONST } from './../src/modules/common/utils/app.constant';
import { HtmlTestService } from './../src/modules/scrapper/utils/html-test.service';

const ngjs = new Nightmare({ show: true, waitTimeout: 1800000, gotoTimeout: 1800000, loadTimeout: 1700000, executionTimeout: 1800000 });
    


(async function() {
const reviewsMainPageUrl = 'https://gartner.com' + reviewRelativePath; // Please assume the URL provided above

    const respBody = await getRespFromWebScrapingApi(reviewsMainPageUrl);
    const new$ = cheerio.load(respBody);
    const completeNew = cheerio.load(new$.html());
    const data = completeNew('.uxd-truncate-text').text();
    //console.log('data:', data) // Just checking if I am getting proper data

    // Now there are two loops -  one for the reviews in a page and another one for the whole set of pages
    // const browser = await puppeteer.launch({headless: false})
    // const page = await browser.newPage();
    const readReviewList = completeNew('.read-review-link').children();
  // ReadreviewList is the one that I am planning to iterate over , albeit there is a cavaet described below as an end Note
   await scrapeFullReviewNMJS(reviewsMainPageUrl, readReviewList);

})()

async function scrapeFullReviewNMJS(reviewsMainPageUrl: string, readReviewList) {
    
    
    await scrapeOneReview(reviewsMainPageUrl)
}

async function ngjsResult(reviewsMainPageUrl, index=1) {
        console.log('call to result', index)
        return new Promise((resolve, reject) => {
            ngjs
            .goto(reviewsMainPageUrl)
            .wait('#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul')
            .click(`#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul > li:nth-child(${index}) > div > div.read-review-link > button`)
            .wait('#review-nav > li.active > a')
            .evaluate(() => {
                // getting scrape items from review detail page
                let reviwedCustomerDetails: Record<string, string> = {};

                let completeReviewDetails: Record<string, string> = {};
                let otherQAs: Record<string, string>[] = [];

                // construct completeREviewDetails
                const reviewTitle = document.querySelector('div.category.headline.condensed > h2')?.textContent;
                const overallRating = document.querySelector('div.avgStarIcon > span > span').getAttribute('style')?.substr(7)?.replace(/[%]/, '')?.replace(';', '');
                const reviewRatingUseful = document.querySelector('#review-helpful')?.textContent;
                const completeReview = document.querySelector('#sub-head > p > span.commentSuffix')?.textContent;

                completeReviewDetails = {
                    reviewTitle: reviewTitle,
                    reviewRatingUsefulNess: reviewRatingUseful,
                    reviewOverallRating: overallRating,
                    reviewCompleteDetail: completeReview,
                };

                // constrcut reviewerProfile
                const reviewerProfile = document.querySelector('#profile > div > div.user-info.row > div > div.reviewer-title.row > div.col-xs-10.title > span')?.textContent;
                const reviewerIndustry = document.querySelector('#industry > span')?.textContent;
                const reviewerRole = document.querySelector('#roles > span')?.textContent;
                const reviewerIndustrySize = document.querySelector('#companySize > span')?.textContent;
                const reviewerImplementationStratergy = document.querySelector('#profile > div > div.user-info.row > div >' + ' div:nth-child(3) > span')?.textContent;

                reviwedCustomerDetails = {
                    reviewerProfile: reviewerProfile,
                    reviewerIndustry: reviewerIndustry,
                    reviewerRole: reviewerRole,
                    reviewerIndustrySize: reviewerIndustrySize,
                    reviewerImplementationStratergy: reviewerImplementationStratergy,
                };
                // construct otherQAs
                return { rd: completeReviewDetails, rP: reviwedCustomerDetails, url: window.location.href };
            })
            
            .then( async data => {
                console.log('getting data:',data )
                
                resolve(data)
            }).catch(err => {
                console.log('err:', err)
                resolve({})
            })  
        })
        
    
    
}

async function scrapeOneReview(reviewsMainPageUrl) {
    let detailsList = []
    let proceed = true;
    let firstAttempt = true;
    
// THIS IS THE PART WITH the issue. Now when I call it as a single instance nightmarejs call, it works fine.
        const ds = await ngjsResult(reviewsMainPageUrl)
        
        detailsList.push(ds)
        console.log('compl:', detailsList)  
// But if I wanted to loop through it, there comes a problem, I have identified one way to go over this is using like below, but its not dynamic , 
        const ds = await ngjsResult(reviewsMainPageUrl, 1).then(async data => {
          return await ngjsResult(reviewsMaingPageUrl, 2) 
         })
        // .then() has to be appended for the entire list 
        detailsList.push(ds)
        console.log('compl:', detailsList)  
    
        
        
        return Promise.resolve(detailsList)
    }

注意：ul有一个列表

和几个独立的 and ，所以不要假设 ul 元素的迭代只会给出所需的 li 项目出于当前目的，让我们保持我想迭代前两个评论

有没有合适的方法来循环这个并得到想要的结果？

更新：

我确实尝试过这样的 for 循环

for(let i=1; i<15; i++) {
        const ds = await ngjsResult(reviewsMainPageUrl, i)
        detailsList.push(ds)
    }

但我遇到了类似的错误

Error: navigation error
    at unserializeError (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:162:13)
    at EventEmitter.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:89:13)
    at Object.onceWrapper (events.js:520:26)
    at EventEmitter.emit (events.js:400:28)
    at EventEmitter.emit (domain.js:470:12)
    at ChildProcess.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:49:10)
    at ChildProcess.emit (events.js:400:28)
    at ChildProcess.emit (domain.js:470:12)
    at emit (internal/child_process.js:910:12)
    at processTicksAndRejections (internal/process/task_queues.js:83:21) {
  code: -3,
  details: 'ERR_ABORTED',
  url: 'https://www.gartner.com/reviews/market/unified-communications-as-a-service-worldwide/vendor/ringcentral/product/ringcentral-office/reviews?marketSeoName=unified-communications-as-a-service-worldwide&vendorSeoName=ringcentral&productSeoName=ringcentral-office'
}

我知道像 2,4 这样的索引可能没有相应的 li 选择器（不会有 li:nth-child(2) 或 4），因为正如我上面所说，当我使用 chrome 进行调试时，我可以看到ul 元素在数组中有其他 html 元素，如 span 和 div。但上述错误适用于所有内容，即使是有效的 li 选择器，如 li:nth-child(3) 或 6,7,8...

【问题讨论】：

标签： node.js web-scraping nightmare

【解决方案1】：

你可以拥有多个 nightmare 实例，这会更快。

根据我的经验，如果您有两种类型的抓取工具，则此问题更易于管理：

底座收集器刮刀
这会收集所有基本页面详细信息和描述 URL。将它们放入一些存储中，也许是数据库。
描述网址刮刀
此初始化检查尚未抓取的描述 url，然后运行以获取详细信息，可能是多个并行

这包括一些维护记录的开销，但回报很好，有助于实现重试、最大文章收集、明智决策等机制

要将其与Apache Nifi 完美混合，您可以在其中随时随地进行扩展（实时）。此外，如果您设计得当，所有生命体征/统计数据都很容易看到。

【讨论】：