【问题标题】:BeautifulSoup Loop Thru ItemsBeautifulSoup 循环项目
【发布时间】:2019-09-02 02:30:12
【问题描述】:

我有一个具有以下结构的页面

<div class="cloud-grid margin-bottom-40">
<div class="cloud-grid__col is-6">
  <a href="https://cloud.google.com/bigquery/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="bigQuery" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    BigQuery
  </a>
  <div class="cloud-product-card__sub-headline">
    A fully managed, highly scalable data warehouse with built-in ML.
  </div>
  <a href="https://cloud.google.com/dataflow/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataflow" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Dataflow
  </a>
  <div class="cloud-product-card__sub-headline">
    Real-time batch and stream data processing.
  </div>
  <a href="https://cloud.google.com/dataproc/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataproc" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Dataproc
  </a>
  <div class="cloud-product-card__sub-headline">
    Managed Spark and Hadoop service.
  </div>
  <a href="https://cloud.google.com/datalab/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDatalab" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Datalab
  </a>
  <div class="cloud-product-card__sub-headline">
    Explore, analyze, and visualize large datasets.
  </div>
  <a href="https://cloud.google.com/dataprep/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataprep" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Dataprep
  </a>
  <div class="cloud-product-card__sub-headline">
    Cloud data service to explore, clean, and prepare data for analysis.
  </div>
  <a href="https://cloud.google.com/pubsub/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudPubSub" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Pub/Sub
  </a>
  <div class="cloud-product-card__sub-headline">
    Ingest event streams from anywhere, at any scale.
  </div>
</div>
<div class="cloud-grid__col is-6">
  <a href="https://cloud.google.com/composer/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudComposer" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Composer
  </a>
  <div class="cloud-product-card__sub-headline">
    A fully managed workflow orchestration service built on Apache Airflow.
  </div>
  <a href="https://cloud.google.com/data-fusion/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataFusion" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Cloud Data Fusion
  </a>
  <div class="cloud-product-card__sub-headline">
    Fully managed, code-free data integration.
  </div>
  <a href="https://cloud.google.com/data-catalog/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="dataCatalog" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Data Catalog
  </a>
  <div class="cloud-product-card__sub-headline">
    A fully managed and highly scalable data discovery and metadata
    management service.
  </div>
  <a href="https://cloud.google.com/genomics/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="genomics" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Genomics
  </a>
  <div class="cloud-product-card__sub-headline">
    Power your science with Google Genomics.
  </div>
  <a href="https://marketingplatform.google.com/about/enterprise/#?modal_active=none" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="googleMarketingPlatform" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Google Marketing Platform*
  </a>
  <div class="cloud-product-card__sub-headline">
    Enterprise analytics for better customer experiences.
  </div>
  <a href="https://marketingplatform.google.com/about/data-studio/" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="googleDataStudio" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Google Data Studio*
  </a>
  <div class="cloud-product-card__sub-headline">
    Tell great data stories to support better business decisions.
  </div>
  <a href="https://firebase.google.com/products/performance/" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="firebasePerformanceMonitoring" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
    Firebase Performance Monitoring
  </a>
  <div class="cloud-product-card__sub-headline">
    Gain insight into your app's performance.
  </div>
</div>

我还有一个 python 脚本,它将获取 html 代码并提取以下元素:

a class="cloud-product-card__headline" 获取 [href] 和文本

div class="cloud-product-card__sub-headline" 获取文本

这是我的代码:

soup = BeautifulSoup(html_elem, 'html.parser')

listdt = []
for dt in soup.find_all(True, {"class": ["cloud-product-card__headline", "cloud-product-card__sub-headline"]}):
                listdt.append(dt)

    for dt in listdt:
            prod_name = dt.find_next('a').text.strip()
            prod_href = dt.find_next('a')['href'] if dt.find_next('a') is not None else '----'
            prod_desc = dt.find_next('div').text.strip()
            print(prod_name + ' - ' + prod_href  + ' - ' + prod_desc)

我设法恢复了所有结果,但它们非常杂乱无章。

我正试图以 csv 或 json 格式从 https://cloud.google.com/products/ 中获取/抓取数据

【问题讨论】:

  • “无组织”是什么意思?结果是不是和html代码中的顺序不一样?

标签: python beautifulsoup


【解决方案1】:

一种稍微不同的方法:这些项目的数量相等,并且结构规则,因此您可以将这三个项目作为列表理解中的列表使用。标题和链接都可以来自类cloud-product-card__headline的元素,然后描述是next_sibling.next_sibling。可以在输出之前对描述进行一些字符串清理。

import requests, re, csv
from bs4 import BeautifulSoup as bs

r = requests.get('https://cloud.google.com/products/')
soup = bs(r.content, 'lxml')
products = [[i.text.strip(), i['href'], re.sub('\n\s+',' ',i.next_sibling.next_sibling.text.strip())] for i in soup.select('.cloud-product-card__headline')]

with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['Title','Link','Description'])
    for product in products:
        w.writerow(product)

示例输出行:

【讨论】:

  • 你试过了吗?
猜你喜欢
  • 2021-08-02
  • 2015-03-01
  • 1970-01-01
  • 1970-01-01
  • 2023-04-03
  • 2022-01-17
  • 2020-07-10
  • 2016-04-12
  • 2016-05-29
相关资源
最近更新 更多