【问题标题】:Why can't I scrape all data from ecommerce websites?为什么我不能从电子商务网站上抓取所有数据?
【发布时间】:2020-09-16 00:14:44
【问题描述】:

实际上,我正在从事一个项目,我必须从电子商务网站上抓取数据。但我无法从这些站点访问我想要的数据。例如,当我想从https://evaly.com.bd/search-results?query=remax%20610d 站点抓取所有列表时,我只会得到<li class="ais-InfiniteHits-sentinel"></li> 作为输出。此外,当我使用print(soup.prettify()) 打印站点的 HTML 代码时,完整的代码不在输出中。这是我所有列表项的代码:

from bs4 import BeautifulSoup
import requests
link = "https://evaly.com.bd/search-results?query=remax%20610"

source = requests.get(
       link).text

soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())
li = soup.find_all("li")
print(li)

这是我运行 print(soup.prettify()) 时的输出:

<!DOCTYPE html>
<html>
 <head>
  <style data-styled="" data-styled-version="5.2.0">
   .lfkzsQ{background-color:white;-webkit-letter-spacing:0.025em;-moz-letter-spacing:0.025em;-ms-letter-spacing:0.025em;letter-spacing:0.025em;font-weight:500;font-size:15px;height:46px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex:1;-ms-flex:1;flex:1;padding:0 17px;border:1px solid var(--primary);border-radius:6px 0 0 6px;outline:none;}/*!sc*/
@media (max-width:425px){.lfkzsQ{width:50%;min-width:50%;}}/*!sc*/
data-styled.g87[id="Searchbar__SeachInput-xnx3kr-0"]{content:"lfkzsQ,"}/*!sc*/
.jtCmJd{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;width:100%;height:100%;border-radius:5px;overflow:hidden;background-color:#f6f6f6;}/*!sc*/
data-styled.g88[id="Searchbar__Container-xnx3kr-1"]{content:"jtCmJd,"}/*!sc*/
.BVXNH{cursor:pointer;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;padding-right:29px;padding-left:29px;background:var(--primary);color:#fff;}/*!sc*/
@media (max-width:425px){.BVXNH{padding-right:5px;padding-left:5px;}}/*!sc*/
data-styled.g90[id="Searchbar__Button-xnx3kr-3"]{content:"BVXNH,"}/*!sc*/
.XBQPS{font-size:25px;}/*!sc*/
@media (max-width:768px){.XBQPS{font-size:20px;}}/*!sc*/
data-styled.g92[id="Searchbar___StyledMdSearch-xnx3kr-5"]{content:"XBQPS,"}/*!sc*/
.jCIuWZ{display:grid;grid-template-columns:repeat(auto-fill,minmax(200px,1fr));grid-gap:1vw;}/*!sc*/
@media (max-width:768px){.jCIuWZ{grid-template-columns:repeat(auto-fill,minmax(150px,1fr));grid-gap:1vw;}}/*!sc*/
data-styled.g246[id="algoliaConnectComponent__GridP-sc-1c85asy-0"]{content:"jCIuWZ,"}/*!sc*/
.jmbKPm{width:100%;max-width:100px;min-width:0;height:32px;padding:0 16px;-webkit-appearance:none;-moz-appearance:none;appearance:none;background-color:#f5f5fa;font-size:12px;border-radius:4px;}/*!sc*/
data-styled.g247[id="algoliaConnectComponent___StyledInput-sc-1c85asy-1"]{content:"jmbKPm,"}/*!sc*/
.eZHEjD{width:100%;max-width:100px;min-width:0;height:32px;padding:0 16px;-webkit-appearance:none;-moz-appearance:none;appearance:none;background-color:#f5f5fa;font-size:12px;color:#5d6494;border-radius:4px;}/*!sc*/
data-styled.g248[id="algoliaConnectComponent___StyledInput2-sc-1c85asy-2"]{content:"eZHEjD,"}/*!sc*/
.gqxLmc{display:block;height:32px;margin-left:8px;padding-left:16px;padding-right:16px;background:linear-gradient(90deg,#f5515f 0%,#9f041b 100%);color:#fff;border-radius:4px;box-shadow:0 4px 11px 0 rgba(37,44,97,0.15),0 2px 3px 0 rgba(93,100,148,0.2);-webkit-transition:all 0.2s ease-out;transition:all 0.2s ease-out;}/*!sc*/
data-styled.g249[id="algoliaConnectComponent___StyledButton-sc-1c85asy-3"]{content:"gqxLmc,"}/*!sc*/
.gWgnak{display:grid;grid-template-columns:6% 10% auto 25%;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;grid-template-areas:"logo menu search notification";}/*!sc*/
@media (max-width:768px){.gWgnak{grid-template-columns:25% 25% 25% 25%;grid-template-areas:"menu logo logo user" "notification notification notification notification" "search search search search";}.gWgnak .logo{justify-self:center;margin-bottom:1rem;max-width:76px;width:100%;}.gWgnak .menu{position:relative;justify-self:left;}}/*!sc*/
data-styled.g253[id="search-results__GridContainer-sc-6ln6mm-1"]{content:"gWgnak,"}/*!sc*/
.jpeNuX{min-height:3rem;}/*!sc*/
data-styled.g254[id="search-results___StyledDiv-sc-6ln6mm-2"]{content:"jpeNuX,"}/*!sc*/
.ejWvfj{right:30px;bottom:30px;background:linear-gradient(90deg,#f5515f 0%,#9f041b 100%);}/*!sc*/
@media (max-width:767px){.ejWvfj{bottom:75px;}}/*!sc*/
data-styled.g255[id="search-results___StyledButton-sc-6ln6mm-3"]{content:"ejWvfj,"}/*!sc*/
  </style>
  <link href="/static/manifest.json" rel="manifest"/>
  <title>
   E-valy Limited | Online Shopping Mall
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, shrink-to-fit=no, maximum-scale=1.0, user-scalable=no" name="viewport"/>
  <meta content="E-valy Limited | Online Shopping Mall" property="og:title"/>
  <meta content="article" property="og:type"/>
  <meta content="https://s3-ap-southeast-1.amazonaws.com/media.evaly.com.bd/media/2019-08-04_090235.843922android-icon-200x200.png" property="og:image"/>
  <meta content="450" property="og:image:width"/>
  <meta content="298" property="og:image:height"/>
  <meta content="https://evaly.com.bd" property="og:url"/>
  <meta content="E-valy is an e-commerce site which will be capable of providing every kind of goods and products from every sector to every consumer located in Bangladesh." property="og:description"/>
  <link href="/static/images/icons/favicon.ico" rel="shortcut icon"/>
  <meta content="evaly://" property="al:android:url"/>
  <meta content="Evaly" property="al:android:app_name"/>
  <meta content="bd.com.evaly.evalymarchant" property="al:android:package"/>
  <meta content="14" name="next-head-count"/>
  <link as="style" href="/_next/static/css/d48fe9f040f8d2f97c7e.css" rel="preload"/>
  <link href="/_next/static/css/d48fe9f040f8d2f97c7e.css" rel="stylesheet"/>
  <link as="script" href="/_next/static/RZ7VftogY8QkgPiLg6BPz/pages/_app.js" rel="preload"/>
  <link as="script" href="/_next/static/RZ7VftogY8QkgPiLg6BPz/pages/search-results.js" rel="preload"/>
  <link as="script" href="/_next/static/runtime/webpack-6b3d3cda09a7b5b5debf.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/framework.7dfd02d307191d63a37e.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/b637e9a5.a705a21716e5b01f8145.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/0c9dcbbe.7fbd830a3d684b32423b.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/commons.afffbbb0420dd9af938a.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/6a597b002e9daab94e2e0adeb626acca4f1f6515.28c9d68d9749974f08e1.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/bba5516912876db85383b691379c4486ab998795.071cf6d38264238f2f49.js" rel="preload"/>
  <link as="script" href="/_next/static/runtime/main-3c89e50e2c7d7034f938.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/252f366e.32bec51017e26b1dae31.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/95b64a6e.a74dcc7937bf0c356811.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/d7eeaac4.afdce0938beabe8eef9a.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/2dc48ec14d05924f473dce007726385374c258b9.0a52afc0ae53472a590f.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/3ad14741d7bfb55e1bcea5bfc6670f090f0855af.b5af8ef4be1abd2d5791.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/f6d549f16f3909adbb4f9a302aacab15937bfbda.94c734c42c1caf61b869.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/a9dd91d4607a584382b3e8a70a910ee9fb417c65.cabb84905704185ea6f6.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/4cbc61372435748121077b3b94e57617b6c8338d.5ae2119035f5c9d8c81c.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/411365f484ca502253106aae57d21ae3bb416d15.2f90a1a0cb46996155b4.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/69ef8573555555a232f56c2d2a1de6a4101c15d0.d8f92afd6f8ceb35f607.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/5d7bf10f24bff82d5530a050de689a7c020a359b.36ce757546da64e3337c.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/c8a8012dbcfaeb41f17a667b3a927ba45766e4a2.312913bb8463128a068e.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/c1f80152d80b1129cab9e73f90501b8957be40a7.04f2303ad32c2682fab1.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/8d4460396e9219a79f33af22e0a8f4fe429b291e.cda426e58b75b281586e.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/57f045ed70322177467d785413f62aff844e25d2.ad35b737612878a9f01a.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/0378a7d7ac3f1a3f5f0e99380b068fe3a41b14e6.46f0a10d89a7db3593b1.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/680dd3e5bbe68ece4bf42804461f8830da8bd4e0.d71300269070cc46823a.js" rel="preload"/>
 </head>
 <body>
  <div id="__next">
   <div class="jsx-2334610719 min-h-screen pb-2" style="background-color:#F7F8FA">
    <div class="ais-InstantSearch__root">
     <div class="topbar bg-gray-100 py-1 text-gray-600 hidden md:block">
      <div class="container flex justify-between text-sm">
       <div class="flex">
        <div class="mr-4">
         <a href="https://merchant.evaly.com.bd/">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#shop" xlink:href="/static/images/icons.svg#shop">
           </use>
          </svg>
          Merchant zone
         </a>
        </div>
        <div class="mr-4">
         <a href="/feeds">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#newsfeed" xlink:href="/static/images/icons.svg#newsfeed">
           </use>
          </svg>
          News Feed
         </a>
        </div>
        <div class="mr-4">
         <a href="https://play.google.com/store/apps/details?id=bd.com.evaly.evalyshop">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#mobile" xlink:href="/static/images/icons.svg#mobile">
           </use>
          </svg>
          Download App
         </a>
        </div>
       </div>
       <div class="flex">
        <div class="mr-4">
         <a href="https://www.facebook.com/groups/EvalyHelpDesk/">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#help" xlink:href="/static/images/icons.svg#help">
           </use>
          </svg>
          <!-- -->
          Help
         </a>
        </div>
        <div>
         <a href="https://www.facebook.com/evaly.com.bd/">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#facebook" xlink:href="/static/images/icons.svg#facebook">
           </use>
          </svg>
          <!-- -->
          Follow us
         </a>
        </div>
       </div>
      </div>
     </div>
     <div class="bg-white header" style="box-shadow:0 4px 16px 0 rgba(0,0,0,0.04)">
      <div class="search-results__Container-sc-6ln6mm-0 hFUCjp container py-5 px-8">
       <div class="search-results__GridContainer-sc-6ln6mm-1 gWgnak">
        <a class="logo xs:w-1/2" href="/" style="grid-area:logo">
         <img alt="logo" class="" src="/static/images/logo.svg" style="max-width:76px"/>
        </a>
        <button class="text-2xl menu md:block mb-4 md:mb-0" style="grid-area:menu">
         <svg class="m-auto text-gray-700" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg">
          <path d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z">
          </path>
         </svg>
        </button>
        <div class="md:hidden mb-4" style="grid-area:user;justify-self:right">
         <button class="flex items-center">
          <span class="flex w-full items-center text-gray-700">
           <span>
            <svg color="#1D2531" fill="currentColor" height="25" size="25" stroke="currentColor" stroke-width="0" style="color:#1D2531" viewbox="0 0 1024 1024" width="25" xmlns="http://www.w3.org/2000/svg">
             <path d="M858.5 763.6a374 374 0 0 0-80.6-119.5 375.63 375.63 0 0 0-119.5-80.6c-.4-.2-.8-.3-1.2-.5C719.5 518 760 444.7 760 362c0-137-111-248-248-248S264 225 264 362c0 82.7 40.5 156 102.8 201.1-.4.2-.8.3-1.2.5-44.8 18.9-85 46-119.5 80.6a375.63 375.63 0 0 0-80.6 119.5A371.7 371.7 0 0 0 136 901.8a8 8 0 0 0 8 8.2h60c4.4 0 7.9-3.5 8-7.8 2-77.2 33-149.5 87.8-204.3 56.7-56.7 132-87.9 212.2-87.9s155.5 31.2 212.2 87.9C779 752.7 810 825 812 902.2c.1 4.4 3.6 7.8 8 7.8h60a8 8 0 0 0 8-8.2c-1-47.8-10.9-94.3-29.5-138.2zM512 534c-45.9 0-89.1-17.9-121.6-50.4S340 407.9 340 362c0-45.9 17.9-89.1 50.4-121.6S466.1 190 512 190s89.1 17.9 121.6 50.4S684 316.1 684 362c0 45.9-17.9 89.1-50.4 121.6S557.9 534 512 534z">
             </path>
            </svg>
           </span>
          </span>
         </button>
        </div>
        <div style="grid-area:search">
         <form action="" novalidate="" role="search">
          <div class="Searchbar__Container-xnx3kr-1 jtCmJd">
           <input class="Searchbar__SeachInput-xnx3kr-0 lfkzsQ" placeholder="Search..." type="search" value="remax 610"/>
           <figure class="Searchbar__Button-xnx3kr-3 BVXNH" color="black">
            <svg _css2="
    @media (max-width: ,768px,) {
      ,
            font-size:20px;
          ,
    }
  " class="Searchbar___StyledMdSearch-xnx3kr-5 XBQPS" color="white" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" style="color:white" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg">
             <path d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3 9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 
4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z">
             </path>
            </svg>
           </figure>
          </div>
         </form>
        </div>
        <div class="md:pl-4 notification hidden md:block" style="grid-area:notification">
         <div class="flex justify-between items-center mb-4 mx-16 md:mx-0 md:mb-0 lg:ml-8">
          <button class="text-2xl menu md:hidden">
           <svg class="m-auto" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg">
            <path d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z">
            </path>
           </svg>
          </button>
          <button class="relative">
           <svg color="#1D2531" fill="currentColor" height="25" size="25" stroke="currentColor" stroke-width="0" style="color:#1D2531" view

如何解决这些问题? 编辑:使用 Selenium 和 Chrome 驱动程序对我的项目来说会更耗时

【问题讨论】:

  • 您的问题可能是网络抓取中最常见的问题 - 如果您通过 requests.get 加载网页,则页面上的任何 JavaScript 都不会被执行,并且可能是实际填充的 JavaScript您所追求的列表(事实上,这很可能适用于现代网站)。您需要像 selenium 这样的东西和像 Chrome Driver for Selenium 这样的无头浏览器来加载页面并在其上执行脚本
  • 实际使用 Selenium 和 Chrome 驱动对我的项目来说会比较耗时。有没有其他选择可以留下美丽的汤? @Grismar
  • bs4 将与 Selenium 一起使用,但肯定有其他选择 - 但是,无论您选择什么,您都需要一个为您运行 JavaScript 的无头浏览器。在所有替代方案中,我想说selenium 是最容易设置并且性能相当不错的,但这只是我的看法。
  • 我想用列表数据做一个网站。是否可以使用无头浏览器并为网站收集数据? @Grismar
  • 一切皆有可能,真正的问题始终是:这是个好主意吗?您绝对可以使用无头浏览器来收集您需要的数据(前提是您这样做是合法的),但如果您正在运行一个向第三方提供该信息的网站,您可能希望将该数据缓存在数据库中各方/其他用户。但是,如何做到这一点超出了一个简单的 StackOverflow 问题的范围。

标签: python html web-scraping beautifulsoup


【解决方案1】:

使用 requestsjson 尝试以下方法。我已经使用 API URL 创建了脚本,该脚本是通过检查 chrome 中在页面加载时触发的网络调用来获取的,然后创建一个动态表单数据以遍历每个页面以获取数据.

脚本到底在做什么:

  1. 第一个脚本将创建一个表单数据来查询 API 调用,其中 page_noquery string 和 max values per facet(数字要显示的结果)是动态的,其中参数 page_no 将在每次遍历完成后递增 1。

  2. 请求将使用 POST 方法从创建的表单数据和 URL 中获取数据,然后将其传递给 JSON 对其进行解析并以 json 格式加载。

  3. 然后从解析出来的数据脚本会遍历数据实际存在的json对象。

  4. 最后将所有批次的每一页数据一个一个循环打印出来。

现在脚本显示的信息很少,您可以像我在下面所做的那样从 json 对象访问更多信息。

import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs

def scrap_evaly_data():

QUERY = 'remax%20610' #query string can be changed to fetch another product data
MAX_VALUES_PER_FACET = 10 #no. of result show per page
page_no = 0 # default page no.
URL = 'https://eza2j926q5-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20react%20(16.13.1)%3B%20react-instantsearch%20(5.7.0)%3B%20JS%20Helper%20(2.28.1)&x-algolia-application-id=EZA2J926Q5&x-algolia-api-key=ca9abeea06c16b7d531694d6783a8f04' # API URL for querying

while True:
    print('Hold on creating new form data...')
    form_data = {
    "requests":[{"indexName":"products","params":"query=" + QUERY + "&maxValuesPerFacet=" + str(MAX_VALUES_PER_FACET) + "&page=" + str(page_no) + "&highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&facets=%5B%22price%22%2C%22category_name%22%2C%22brand_name%22%2C%22shop_name%22%2C%22color%22%5D&tagFilters="}]
    } # form_data which is dynamic and creates new set of results and send back
    response = requests.post(URL,json = form_data,verify = False) #requests for data using POST and JSON form data
    print('Created new form data going to fetch data...')

    result = json.loads(response.text) #load json data result
    if len(result) == 0: #condition to check whether result has length or not if not then break and come out from the while loop.
        break
    else:
        for item in result['results'][0]['hits']: #loop on the product information JSON object
            print('-' * 100)
            print('Brand Name: ', item['brand_name'])
            print('Category Name: ' , item['category_name'])
            print('Discount Price: ' , item['discounted_price'])
            print('Max Price: ' , item['max_price'])
            print('Min Price: ' , item['min_price'])
            print('Product Name: ' , item['name'])
            print('Product Image: ' , item['product_image'])
            print('Shop Item ID: ' , item['shop_item_id'])
            print('Shop Name: ' , item['shop_name'])
            print('Slug Info: ' , item['slug'])
            print('-' * 100)

        page_no +=1 #Increment the page number by 1 after each traversal


   scrap_evaly_data()

【讨论】:

  • 能否将代码粘贴到 Pastebin 中?调用函数 scrape_evaly_date() 时出错
  • 如果我想从 min 到 max 排序,我应该在哪里更改代码? @Vin
  • 如果这完全适合您,请也投票 - 谢谢
  • f 我想废弃一个 FROM DATA 不在 DICT 中的网站,例如 ajkerdeal.com/searchproduct.aspx 我应该在代码中的 from_data 中进行哪些更改? @Vin
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-06-21
  • 1970-01-01
  • 2019-07-13
  • 1970-01-01
  • 2021-07-18
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多