【发布时间】:2019-09-02 02:30:12
【问题描述】:
我有一个具有以下结构的页面
<div class="cloud-grid margin-bottom-40">
<div class="cloud-grid__col is-6">
<a href="https://cloud.google.com/bigquery/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="bigQuery" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
BigQuery
</a>
<div class="cloud-product-card__sub-headline">
A fully managed, highly scalable data warehouse with built-in ML.
</div>
<a href="https://cloud.google.com/dataflow/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataflow" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Dataflow
</a>
<div class="cloud-product-card__sub-headline">
Real-time batch and stream data processing.
</div>
<a href="https://cloud.google.com/dataproc/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataproc" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Dataproc
</a>
<div class="cloud-product-card__sub-headline">
Managed Spark and Hadoop service.
</div>
<a href="https://cloud.google.com/datalab/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDatalab" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Datalab
</a>
<div class="cloud-product-card__sub-headline">
Explore, analyze, and visualize large datasets.
</div>
<a href="https://cloud.google.com/dataprep/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataprep" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Dataprep
</a>
<div class="cloud-product-card__sub-headline">
Cloud data service to explore, clean, and prepare data for analysis.
</div>
<a href="https://cloud.google.com/pubsub/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudPubSub" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Pub/Sub
</a>
<div class="cloud-product-card__sub-headline">
Ingest event streams from anywhere, at any scale.
</div>
</div>
<div class="cloud-grid__col is-6">
<a href="https://cloud.google.com/composer/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudComposer" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Composer
</a>
<div class="cloud-product-card__sub-headline">
A fully managed workflow orchestration service built on Apache Airflow.
</div>
<a href="https://cloud.google.com/data-fusion/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="cloudDataFusion" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Cloud Data Fusion
</a>
<div class="cloud-product-card__sub-headline">
Fully managed, code-free data integration.
</div>
<a href="https://cloud.google.com/data-catalog/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="dataCatalog" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Data Catalog
</a>
<div class="cloud-product-card__sub-headline">
A fully managed and highly scalable data discovery and metadata
management service.
</div>
<a href="https://cloud.google.com/genomics/" track-type="navigateTo" track-name="link" track-metadata-eventdetail="genomics" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Genomics
</a>
<div class="cloud-product-card__sub-headline">
Power your science with Google Genomics.
</div>
<a href="https://marketingplatform.google.com/about/enterprise/#?modal_active=none" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="googleMarketingPlatform" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Google Marketing Platform*
</a>
<div class="cloud-product-card__sub-headline">
Enterprise analytics for better customer experiences.
</div>
<a href="https://marketingplatform.google.com/about/data-studio/" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="googleDataStudio" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Google Data Studio*
</a>
<div class="cloud-product-card__sub-headline">
Tell great data stories to support better business decisions.
</div>
<a href="https://firebase.google.com/products/performance/" target="_blank" rel="noopener" track-type="navigateTo" track-name="link" track-metadata-eventdetail="firebasePerformanceMonitoring" track-metadata-position="body" track-metadata-section="dataAnalytics" class="cloud-product-card__headline">
Firebase Performance Monitoring
</a>
<div class="cloud-product-card__sub-headline">
Gain insight into your app's performance.
</div>
</div>
我还有一个 python 脚本,它将获取 html 代码并提取以下元素:
a class="cloud-product-card__headline" 获取 [href] 和文本
div class="cloud-product-card__sub-headline" 获取文本
这是我的代码:
soup = BeautifulSoup(html_elem, 'html.parser')
listdt = []
for dt in soup.find_all(True, {"class": ["cloud-product-card__headline", "cloud-product-card__sub-headline"]}):
listdt.append(dt)
for dt in listdt:
prod_name = dt.find_next('a').text.strip()
prod_href = dt.find_next('a')['href'] if dt.find_next('a') is not None else '----'
prod_desc = dt.find_next('div').text.strip()
print(prod_name + ' - ' + prod_href + ' - ' + prod_desc)
我设法恢复了所有结果,但它们非常杂乱无章。
我正试图以 csv 或 json 格式从 https://cloud.google.com/products/ 中获取/抓取数据
【问题讨论】:
-
“无组织”是什么意思?结果是不是和html代码中的顺序不一样?
标签: python beautifulsoup