【问题标题】:How to scrape specific words from table row?如何从表格行中抓取特定单词?
【发布时间】:2020-10-01 21:51:45
【问题描述】:

我只想使用 python 从下表中抓取代码

如图所示,您可以看到我只想抓取 CPT、CTC、PTC、STC、SPT、HTC、P5TC、P1A、P2A P3A、P1E、P2E、P3E。此代码可能会不时更改,例如添加 P4E 或删除 P1E。

上表的HTML代码是:

<table class="list">
   <tbody>
      <tr>
         <td>
            <p>PRODUCT<br>DESCRIPTION</p>
         </td>
         <td>
            <p><strong>Time Charter:</strong> CPT, CTC, PTC, STC, SPT, HTC, P5TC<br><strong>Time Charter Trip:</strong> P1A, P2A, P3A,<br>P1E, P2E, P3E</p>
         </td>
         <td><strong>Voyage: </strong>C3E, C4E, C5E, C7E</td>
      </tr>
      <tr>
         <td>
            <p>CONTRACT SIZE</p>
            <p></p>
         </td>
         <td>
            <p>1 day</p>
         </td>
         <td>
            <p>1,000 metric tons</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>MINIMUM TICK</p>
            <p></p>
         </td>
         <td>
            <p>US$ 25</p>
         </td>
         <td>
            <p>US$ 0.01</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>FINAL SETTLEMENT PRICE</p>
            <p></p>
         </td>
         <td colspan="2" rowspan="1">
            <p>The floating price will be the end-of-day price as supplied by the Baltic Exchange.</p>
            <p><br><strong>All products:</strong> Final settlement price will be the mean of the daily Baltic Exchange spot price assessments for every trading day in the expiry month.</p>
            <p><br><strong>Exception for P1A, P2A, P3A:</strong> Final settlement price will be the mean of the last 7 Baltic Exchange spot price assessments in the expiry month.</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>CONTRACT SERIES</p>
         </td>
         <td colspan="2" rowspan="1">
            <p><strong><strong>CTC, CPT, PTC, STC, SPT, HTC, P5TC</strong>:</strong> Months, quarters and calendar years out to a maximum of 72 months</p>
            <p><strong>C3E, C4E, C5E, C7E, P1A, P2A, P3A, P1E, P2E, P3E:</strong> Months, quarters and calendar years out to a maximum of 36 months</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>SETTLEMENT</p>
         </td>
         <td colspan="2" rowspan="1">
            <p>At 13:00 hours (UK time) on the last business day of each month within the contract series</p>
         </td>
      </tr>
   </tbody>
</table>

您可以从以下网站链接查看代码

https://www.eex.com/en/products/global-commodities/freight

【问题讨论】:

    标签: python selenium xpath beautifulsoup css-selectors


    【解决方案1】:

    如果变量 txt 包含您问题中的 HTML,则此脚本会提取所有必需的代码:

    import re
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(txt, 'html.parser')
    text = soup.select_one('td:contains("Time Charter:")').text
    codes = re.findall(r'[A-Z\d]{3}', text)
    
    print(codes)
    

    打印:

    ['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5T', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E']
    

    编辑:要从所有表中获取代码,您可以使用此脚本:

    import re
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(txt, 'html.parser')
    all_codes = []
    for td in soup.select('td:contains("Time Charter:")'):
        all_codes.extend(re.findall(r'[A-Z\d]{3}', td.text))
    print(all_codes)
    

    【讨论】:

    • 谢谢安德烈。这有帮助。但是如果页面包含 2 个表,可以告诉我需要什么类型的修改。所以在这段代码中 (text = soup.select_one('td:contains("Time Charter:")').text) 只会从第一个表中提取代码。页面还包含第二个表,其中包含 Time Charter 下的代码。
    • 代码输出分为 2 组。如果您可以更新代码以在一组中获得此功能,那就太好了。你的输出: [ ['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5TC', 'P1A', 'P2A', 'P3A', 'P1E', ' P2E', 'P3E'], ['OCPM', 'OCTM', 'OPTM', 'OTSM', 'OPSM', 'OHTM', 'O5PM'] ] 所需输出:['CPT', 'CTC', 'PTC'、'STC'、'SPT'、'HTC'、'P5TC'、'P1A'、'P2A'、'P3A'、'P1E'、'P2E'、'P3E'、'OCPM'、'OCTM '、'OPTM'、'OTSM'、'OPSM'、'OHTM'、'O5PM']
    • 非常感谢。很好的答案@Andrej Kesely。这是一个完整的解决方案。真的很感激。
    【解决方案2】:

    如果您的用例是抓取所有文本:

    您必须将WebDriverWait 诱导为所需的visibility_of_element_located(),您可以使用以下任一Locator Strategies

    • 使用CSS_SELECTOR

      driver.get('https://www.eex.com/en/products/global-commodities/freight')
      print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p"))).text)
      
    • 使用XPATH

      driver.get('https://www.eex.com/en/products/global-commodities/freight')
      print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p"))).text)
      
    • 控制台输出:

      Time Charter: CPT, CTC, PTC, STC, SPT, HTC, P5TC
      Time Charter Trip: P1A, P2A, P3A,
      P1E, P2E, P3E
      
    • 注意:您必须添加以下导入:

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      

    更新 1

    如果要提取CPT、CTC、PTC、STC、SPT、HTC、P5TCP1A、P2A、P3AP1E、P2E、P3E 单独使用,您可以使用以下解决方案:

    • 印刷CPT、CTC、PTC、STC、SPT、HTC、P5TC

      #element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
      element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
      print(driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip())
      
    • 打印P1A,P2A P3A

      #element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
      element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
      print(driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip())
      
    • 打印P1E、P2E、P3E

      //element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
      element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
      print(driver.execute_script('return arguments[0].lastChild.textContent;', element).strip())
      

    更新 2

    将所有项目一起打印:

    • 代码块:

      driver.get('https://www.eex.com/en/products/global-commodities/freight')
      element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
      first = driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip()
      second = driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip()
      third = driver.execute_script('return arguments[0].lastChild.textContent;', element).strip()
      for list in (first,second,third):
          print(list)
      
    • 控制台输出:

      CPT, CTC, PTC, STC, SPT, HTC, P5TC
      P1A, P2A, P3A,
      P1E, P2E, P3E
      

    【讨论】:

    • 谢谢。但我只想要列表中的代码。像['CPT','CTC','PTC','STC','SPT','HTC','P5TC','P1A','P2A','P3A','P1E','P2E', 'P3E']
    • @chintanpatel 查看更新的答案,让我知道状态。
    猜你喜欢
    • 2020-10-07
    • 1970-01-01
    • 1970-01-01
    • 2021-09-25
    • 1970-01-01
    • 1970-01-01
    • 2019-10-16
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多