【发布时间】:2019-10-05 07:55:34
【问题描述】:
尝试做一些网页抓取。尝试制作一个可以为每个国家/地区吐出人口的功能。我正在尝试从美国人口普查局进行网络抓取,但我无法取回正确的信息。
https://www.census.gov/popclock/world/af
<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
<div class="data-cell" style = "background-image: url.....">
<p>population</p>
<h2 data-population="">35.8M</h2>"
这基本上就是我试图抓取的代码的样子。我想要的是“35.8M”
我已经尝试了几种方法,但我能得到的只是标题本身“数据填充”,没有任何数据。
有人向我提到,也许该网站有某种格式,因此无法抓取。以我的经验,当它被阻止时,格式看起来会有所不同,它位于图像或动态项目中,或者其他更难抓取的东西。有人对此有什么想法吗?
# -*- coding: utf-8 -*-
# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup
### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.
country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names?
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)
# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format
soup=BeautifulSoup(page.content,"html.parser")
################################################# ################ 开始我不确定的事情
# Locate element on page to be scraped
# find() locates the element in the BeautifulSoup object
1. First method
population = soup.find(id="basic-facts", class="data-cell")
#I tried some methods like this. got only errors
2. Second method
populaiton = soup.findAll("h2", {"data-population": ""})
for i in population:
print i
# this returns the headings for the table but no data
### here we need to take out the population data
### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################# ################ 结尾
# Extract text from the selected BeautifulSoup object using .text
population = population.text
#Final Output
#Return Scraped info
print 'The Population of'+country+'is'+population
我用####### 概述了代码。我尝试了几种方法。我列出了两个
总的来说,我对编码很陌生,所以如果我描述得不好,请原谅,感谢任何人提供的任何建议。
【问题讨论】:
-
您总是可以先检查您阅读的内容的 .text 并检查您是否正确获得了实际页面...
标签: python web-scraping beautifulsoup