【发布时间】:2021-11-16 20:10:41
【问题描述】:
我目前正在尝试从 gartner peer 洞察中抓取一些数据 这是示例 URL:-GPI
如果您看到 I want to scrape ,则通过遍历 ul 列表,查看评论简短描述和每条评论的详细描述。这是使用 nightmarejs 和cheerio 完成的
我有以下代码:
const got = require('got');
import { html } from 'cheerio';
import cheerio = require('cheerio');
import { children } from 'cheerio/lib/api/traversing';
const Nightmare = require('nightmare');
import { AppConfigService } from './../src/modules/common/services/app-config/app-config.service';
import { APP_CONST } from './../src/modules/common/utils/app.constant';
import { HtmlTestService } from './../src/modules/scrapper/utils/html-test.service';
const ngjs = new Nightmare({ show: true, waitTimeout: 1800000, gotoTimeout: 1800000, loadTimeout: 1700000, executionTimeout: 1800000 });
(async function() {
const reviewsMainPageUrl = 'https://gartner.com' + reviewRelativePath; // Please assume the URL provided above
const respBody = await getRespFromWebScrapingApi(reviewsMainPageUrl);
const new$ = cheerio.load(respBody);
const completeNew = cheerio.load(new$.html());
const data = completeNew('.uxd-truncate-text').text();
//console.log('data:', data) // Just checking if I am getting proper data
// Now there are two loops - one for the reviews in a page and another one for the whole set of pages
// const browser = await puppeteer.launch({headless: false})
// const page = await browser.newPage();
const readReviewList = completeNew('.read-review-link').children();
// ReadreviewList is the one that I am planning to iterate over , albeit there is a cavaet described below as an end Note
await scrapeFullReviewNMJS(reviewsMainPageUrl, readReviewList);
})()
async function scrapeFullReviewNMJS(reviewsMainPageUrl: string, readReviewList) {
await scrapeOneReview(reviewsMainPageUrl)
}
async function ngjsResult(reviewsMainPageUrl, index=1) {
console.log('call to result', index)
return new Promise((resolve, reject) => {
ngjs
.goto(reviewsMainPageUrl)
.wait('#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul')
.click(`#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul > li:nth-child(${index}) > div > div.read-review-link > button`)
.wait('#review-nav > li.active > a')
.evaluate(() => {
// getting scrape items from review detail page
let reviwedCustomerDetails: Record<string, string> = {};
let completeReviewDetails: Record<string, string> = {};
let otherQAs: Record<string, string>[] = [];
// construct completeREviewDetails
const reviewTitle = document.querySelector('div.category.headline.condensed > h2')?.textContent;
const overallRating = document.querySelector('div.avgStarIcon > span > span').getAttribute('style')?.substr(7)?.replace(/[%]/, '')?.replace(';', '');
const reviewRatingUseful = document.querySelector('#review-helpful')?.textContent;
const completeReview = document.querySelector('#sub-head > p > span.commentSuffix')?.textContent;
completeReviewDetails = {
reviewTitle: reviewTitle,
reviewRatingUsefulNess: reviewRatingUseful,
reviewOverallRating: overallRating,
reviewCompleteDetail: completeReview,
};
// constrcut reviewerProfile
const reviewerProfile = document.querySelector('#profile > div > div.user-info.row > div > div.reviewer-title.row > div.col-xs-10.title > span')?.textContent;
const reviewerIndustry = document.querySelector('#industry > span')?.textContent;
const reviewerRole = document.querySelector('#roles > span')?.textContent;
const reviewerIndustrySize = document.querySelector('#companySize > span')?.textContent;
const reviewerImplementationStratergy = document.querySelector('#profile > div > div.user-info.row > div >' + ' div:nth-child(3) > span')?.textContent;
reviwedCustomerDetails = {
reviewerProfile: reviewerProfile,
reviewerIndustry: reviewerIndustry,
reviewerRole: reviewerRole,
reviewerIndustrySize: reviewerIndustrySize,
reviewerImplementationStratergy: reviewerImplementationStratergy,
};
// construct otherQAs
return { rd: completeReviewDetails, rP: reviwedCustomerDetails, url: window.location.href };
})
.then( async data => {
console.log('getting data:',data )
resolve(data)
}).catch(err => {
console.log('err:', err)
resolve({})
})
})
}
async function scrapeOneReview(reviewsMainPageUrl) {
let detailsList = []
let proceed = true;
let firstAttempt = true;
// THIS IS THE PART WITH the issue. Now when I call it as a single instance nightmarejs call, it works fine.
const ds = await ngjsResult(reviewsMainPageUrl)
detailsList.push(ds)
console.log('compl:', detailsList)
// But if I wanted to loop through it, there comes a problem, I have identified one way to go over this is using like below, but its not dynamic ,
const ds = await ngjsResult(reviewsMainPageUrl, 1).then(async data => {
return await ngjsResult(reviewsMaingPageUrl, 2)
})
// .then() has to be appended for the entire list
detailsList.push(ds)
console.log('compl:', detailsList)
return Promise.resolve(detailsList)
}
注意:ul有一个列表
有没有合适的方法来循环这个并得到想要的结果?
更新:
我确实尝试过这样的 for 循环
for(let i=1; i<15; i++) {
const ds = await ngjsResult(reviewsMainPageUrl, i)
detailsList.push(ds)
}
但我遇到了类似的错误
Error: navigation error
at unserializeError (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:162:13)
at EventEmitter.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:89:13)
at Object.onceWrapper (events.js:520:26)
at EventEmitter.emit (events.js:400:28)
at EventEmitter.emit (domain.js:470:12)
at ChildProcess.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:49:10)
at ChildProcess.emit (events.js:400:28)
at ChildProcess.emit (domain.js:470:12)
at emit (internal/child_process.js:910:12)
at processTicksAndRejections (internal/process/task_queues.js:83:21) {
code: -3,
details: 'ERR_ABORTED',
url: 'https://www.gartner.com/reviews/market/unified-communications-as-a-service-worldwide/vendor/ringcentral/product/ringcentral-office/reviews?marketSeoName=unified-communications-as-a-service-worldwide&vendorSeoName=ringcentral&productSeoName=ringcentral-office'
}
我知道像 2,4 这样的索引可能没有相应的 li 选择器(不会有 li:nth-child(2) 或 4),因为正如我上面所说,当我使用 chrome 进行调试时,我可以看到ul 元素在数组中有其他 html 元素,如 span 和 div。但上述错误适用于所有内容,即使是有效的 li 选择器,如 li:nth-child(3) 或 6,7,8...
【问题讨论】:
标签: node.js web-scraping nightmare