【问题标题】:Unable to scrape this site. How to scrape data from this site?无法抓取此网站。如何从该站点抓取数据?
【发布时间】:2019-08-27 06:01:52
【问题描述】:

我无法从该站点抓取数据。

我尝试了其他网站,但其他网站没问题...

from bs4 import BeautifulSoup
from urllib.request import urlopen

response = urlopen("https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1")

html = response.read()

parsed_html = BeautifulSoup(html, "html.parser")

containers = parsed_html.find_all("div", {"class" : "c2prKC"})

print(len(containers))

【问题讨论】:

    标签: python web-scraping beautifulsoup screen-scraping


    【解决方案1】:

    加载后看起来像JS渲染到页面。您可以使用Selenium渲染页面和美丽的汤来获取元素。

    from bs4 import BeautifulSoup
    from selenium import webdriver
    import time
    driver = webdriver.Chrome()
    driver.get("https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1")
    time.sleep(5)
    
    html = driver.page_source
    
    parsed_html = BeautifulSoup(html, "html.parser")
    
    containers = parsed_html.find_all("div", {"class" : "c2prKC"})
    
    print(len(containers))
    

    【讨论】:

      【解决方案2】:

      您想要的信息在脚本标签中。您可以使用正则表达式或循环脚本标签来获取正确的字符串以解析为 json(稍作修改)

      import requests
      import json
      from bs4 import BeautifulSoup as bs
      import pandas as pd
      
      headers = {
          'User-Agent' : 'Mozilla/5.0'
      }
      res = requests.get('https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1', headers = headers)
      soup = bs(res.content, 'lxml')
      for script in soup.select('script'):
          if 'window.pageData=' in script.text:
              script = script.text.replace('window.pageData=','')
              break
      items = json.loads(script)['mods']['listItems']
      results = []
      
      for item in items:
          #print(item)
          #extract other info you want
          row = [item['name'], item['priceShow'], item['productUrl'], item['ratingScore']]
          results.append(row)
      
      df = pd.DataFrame(results, columns = ['Name', 'Price', 'ProductUrl', 'Rating'])
      
      print(df.head())
      

      正则表达式版本:

      import requests
      import json
      from bs4 import BeautifulSoup as bs
      import pandas as pd
      
      headers = {
          'User-Agent' : 'Mozilla/5.0'
      }
      res = requests.get('https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1', headers = headers)
      soup = bs(res.content, 'lxml')
      r = re.compile(r'window.pageData=(.*)')
      data = soup.find('script', text=r).text
      script = r.findall(data)[0]
      items = json.loads(script)['mods']['listItems']
      results = []
      
      for item in items:
          row = [item['name'], item['priceShow'], item['productUrl'], item['ratingScore']]
          results.append(row)
      
      df = pd.DataFrame(results, columns = ['Name', 'Price', 'ProductUrl', 'Rating'])
      
      print(df.head())
      

      【讨论】:

        【解决方案3】:
        import requests
        import json
        from bs4 import BeautifulSoup as bs
        import pandas as pd
        import json
        
        headers = {
            'User-Agent' : 'Mozilla/5.0'
        }
        res = requests.get('https://www.daraz.com.np/catalog/?q=camera&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.71a64360Kgxf1m', headers = headers)
        soup = bs(res.content, 'lxml')
        scriptData=''
        for d in containerSearch:
            if 'window.pageData=' in str(d):
                scriptData=str(d).replace('window.pageData=','')
                break
        scriptData=scriptData.replace('<script>','')
        scriptData=scriptData.replace('</script>','')
        items = json.loads(scriptData)
        name=items['mods']['listItems'][0]['name']
        image=items['mods']['listItems'][0]['image']
        price=items['mods']['listItems'][0]['price']
        priceShow=items['mods']['listItems'][0]['priceShow']
        ratingScore=items['mods']['listItems'][0]['ratingScore']
        productUrl=items['mods']['listItems'][0]['productUrl']
        
        print(name)
        print(price)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2014-07-06
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2023-03-24
          相关资源
          最近更新 更多