【问题标题】:Scraping with Selenium and BeautifulSoup用 Selenium 和 BeautifulSoup 刮痧
【发布时间】:2021-07-28 07:37:52
【问题描述】:

我正在寻找一些关于在 python 中使用 selenium 进行抓取的帮助。您需要一个付费帐户才能查看此页面,因此无法创建可复制的内容。

我正在尝试从蓝点和黑箭头中提取数据。 数据在这段 HTML 中。

<svg viewBox="0 0 105 68" class="video-summaries__field-arrows" preserveAspectRatio="none" xmlns="http://www.w3.org/2000/svg">
   <defs>
      <marker fill="#000" id="default_arrow" markerWidth="5" markerHeight="4" orient="auto" refX="5" refY="2" stroke="none">
         <polygon points="0 0, 5 2, 0 4"></polygon>
      </marker>
      <marker fill="#0033ff" id="hover_arrow" markerWidth="2.9" markerHeight="2.4" orient="auto" refX="2.5" refY="1.2" stroke="none">
         <polygon points="0 0, 2.9 1.2, 0 2.4"></polygon>
      </marker>
   </defs>
   <path class="videosummaries-arrows" d="M52.5 35.1 37.6 33.3" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_0)" style="stroke-width: 0.25;"></path>
   <linearGradient gradientUnits="userSpaceOnUse" id="gradient_0" x1="52.5" x2="37.6" y1="35.1" y2="33.3">
      <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
      <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
   </linearGradient>
   <path class="videosummaries-arrows" d="M38.2 34.7 76.6 62" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_1)" style="stroke-width: 0.25;"></path>
   <linearGradient gradientUnits="userSpaceOnUse" id="gradient_1" x1="38.2" x2="76.6" y1="34.7" y2="62">
      <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
      <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
   </linearGradient>
   <path class="videosummaries-arrows" d="M61.6 67.8 36.3 63.9" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_2)" style="stroke-width: 0.25;"></path>
   <linearGradient gradientUnits="userSpaceOnUse" id="gradient_2" x1="61.6" x2="36.3" y1="67.8" y2="63.9">
      <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
      <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
   </linearGradient>
   <path class="videosummaries-arrows" d="M36.3 63.9 36.5 26.700000000000003" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_3)" style="stroke-width: 0.25;"></path>
   <linearGradient gradientUnits="userSpaceOnUse" id="gradient_3" x1="36.3" x2="36.5" y1="63.9" y2="26.700000000000003">
      <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
      <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
   </linearGradient>

我正在专门尝试抓取 x1,x2,y1,y2 来自linearGradient 标签的数据。

我通过运行获取页面源代码。

options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\James\OneDrive\Desktop\webdriver\chromedriver.exe')
driver.get('https://football.instatscout.com/teams/9487/video')
print("Page Title is : %s" %driver.title)
driver.find_element_by_name('email').send_keys('')
driver.find_element_by_name('pass').send_keys('')
driver.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "hRAqIl", " " ))]').click() 
driver.implicitly_wait(10)
#driver.find_element_by_css_selector('.dropdown-btn:nth-child(12) .video-summaries__checkbox_red ').click()
driver.find_element_by_css_selector('.dropdown-btn:nth-child(12) > .video-summaries__checkbox').click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "ixmoFk", " " ))]').click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "video-summaries__checkbox-column-inner", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "video-summaries__checkbox-column-row", " " )) and (((count(preceding-sibling::*) + 1) = 10) and parent::*)]//*[contains(concat( " ", @class, " " ), concat( " ", "video-summaries__checkbox", " " ))]').click()
driver.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "dropdown-btn", " " )) and (((count(preceding-sibling::*) + 1) = 12) and parent::*)]//*[contains(concat( " ", @class, " " ), concat( " ", "video-summaries__checkbox_red", " " ))]').click()
html = driver.page_source

在硒中 - 但我不知道从那里去哪里。

最后我想把它刮到一个数据框中,有 'Name' 'X1' 'Y1' 'X2' 'Y2' 列。

【问题讨论】:

    标签: python selenium beautifulsoup


    【解决方案1】:

    你可以使用这个来抓取数据:

    from bs4 import BeautifulSoup as bs
    
    html="""
    <svg viewBox="0 0 105 68" class="video-summaries__field-arrows" preserveAspectRatio="none" xmlns="http://www.w3.org/2000/svg">
       <defs>
          <marker fill="#000" id="default_arrow" markerWidth="5" markerHeight="4" orient="auto" refX="5" refY="2" stroke="none">
             <polygon points="0 0, 5 2, 0 4"></polygon>
          </marker>
          <marker fill="#0033ff" id="hover_arrow" markerWidth="2.9" markerHeight="2.4" orient="auto" refX="2.5" refY="1.2" stroke="none">
             <polygon points="0 0, 2.9 1.2, 0 2.4"></polygon>
          </marker>
       </defs>
       <path class="videosummaries-arrows" d="M52.5 35.1 37.6 33.3" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_0)" style="stroke-width: 0.25;"></path>
       <linearGradient gradientUnits="userSpaceOnUse" id="gradient_0" x1="52.5" x2="37.6" y1="35.1" y2="33.3">
          <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
          <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
       </linearGradient>
       <path class="videosummaries-arrows" d="M38.2 34.7 76.6 62" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_1)" style="stroke-width: 0.25;"></path>
       <linearGradient gradientUnits="userSpaceOnUse" id="gradient_1" x1="38.2" x2="76.6" y1="34.7" y2="62">
          <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
          <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
       </linearGradient>
       <path class="videosummaries-arrows" d="M61.6 67.8 36.3 63.9" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_2)" style="stroke-width: 0.25;"></path>
       <linearGradient gradientUnits="userSpaceOnUse" id="gradient_2" x1="61.6" x2="36.3" y1="67.8" y2="63.9">
          <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
          <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
       </linearGradient>
       <path class="videosummaries-arrows" d="M36.3 63.9 36.5 26.700000000000003" fill="none" marker-end="url(#default_arrow)" stroke="url(#gradient_3)" style="stroke-width: 0.25;"></path>
       <linearGradient gradientUnits="userSpaceOnUse" id="gradient_3" x1="36.3" x2="36.5" y1="63.9" y2="26.700000000000003">
          <stop offset="5%" stop-color="#000" stop-opacity="0.1"></stop>
          <stop offset="100%" stop-color="#000" stop-opacity="1"></stop>
       </linearGradient>
       </svg>
    """
    
    soup=bs(html,"xml")
    for lg in soup.find_all("linearGradient",attrs={"gradientUnits":"userSpaceOnUse"}):
        print(lg["x1"],lg["y1"],lg["x2"],lg["y2"])
    
    """
    52.5 35.1 37.6 33.3
    38.2 34.7 76.6 62
    61.6 67.8 36.3 63.9
    36.3 63.9 36.5 26.700000000000003
    """
    

    我们正在使用xml 解析器从svg 中抓取数据。我也用其他 + lxml 解析器进行了测试。但没有成功。其他是基本的,使用tag name 和属性gradientUnits 查找元素。并从element 中查找属性。

    【讨论】:

    • 我已删除此处的回复以避免任何混淆。感谢您的回答。这对我帮助很大。
    猜你喜欢
    • 1970-01-01
    • 2020-03-14
    • 1970-01-01
    • 2022-08-20
    • 2019-04-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-01-25
    相关资源
    最近更新 更多