【问题标题】:Parsing section of website with playwright or requests用剧作家或请求解析网站部分
【发布时间】:2021-12-01 11:13:03
【问题描述】:

尝试解析 Coinbase 博客网站 https://blog.coinbase.com/ 的部分,即 9 篇以下的第一篇文章,从 <div class="streamItem streamItem--section js-streamItem" data-action-scope="_actionscope_6"> 开始以获取最新消息(不知道如何在托管 coinbase 博客的中型平台上以其他方式进行操作)主页上的随机日期和搜索上的随机日期)但由于某种原因无法,首先尝试使用请求并且它有效,但直到本节才有效,并尝试使用下一个代码的剧作家:

# !/usr/bin/env python    
# coding: utf-8  
import asyncio
from playwright.sync_api import sync_playwright  
from playwright.async_api import async_playwright   
import os   
import time    

async def parser():        
    page_path = "https://blog.coinbase.com/"        
    async with async_playwright() as p:          
        browser = await p.chromium.launch(headless=True)           
        page = await browser.new_page(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')         
        await page.goto(page_path)         
        page_content = await page.content()            
        await browser.close()        
        print(page_content)    
        
asyncio.get_event_loop().run_until_complete(parser())   

同样的事情 - 在本节之前它一直有效

我也尝试过像这里 https://scrapingant.com/blog/scrape-dynamic-website-with-python 这样的抓取工具,它有效,但我需要通过请求或剧作家以其他方式解决它,更好地使用请求

【问题讨论】:

  • 仍然没有找到合适的解决方案

标签: python python-requests playwright-python


【解决方案1】:

我能够使用以下代码获得新闻文章的标题:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch(headless=False)
  page = browser.new_page()
  page.goto('https://blog.coinbase.com/', wait_until='domcontentloaded')
  elements = page.query_selector_all('*[data-post-id]')
  titles = []
  for element in elements:
    try:
      title = element.query_selector('h3 div')
      title = title.text_content()
      if not title in titles:
        titles.append(title)
    except Exception as e:
      continue
  print(titles)

它可能不是您正在寻找的东西,但希望它能让您朝着正确的方向前进!

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2023-02-26
    • 2020-10-08
    • 2022-11-03
    • 1970-01-01
    • 2016-01-29
    • 2013-11-01
    • 1970-01-01
    • 2015-09-22
    相关资源
    最近更新 更多