网页抓取。没有得到我想要的答案

【问题标题】：web scraping. Not getting back what I want网页抓取。没有得到我想要的
【发布时间】：2019-10-05 07:55:34
【问题描述】：

尝试做一些网页抓取。尝试制作一个可以为每个国家/地区吐出人口的功能。我正在尝试从美国人口普查局进行网络抓取，但我无法取回正确的信息。

https://www.census.gov/popclock/world/af

<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
   <div class="data-cell" style = "background-image: url.....">
      <p>population</p>
      <h2 data-population="">35.8M</h2>"

这基本上就是我试图抓取的代码的样子。我想要的是“35.8M”

我已经尝试了几种方法，但我能得到的只是标题本身“数据填充”，没有任何数据。

有人向我提到，也许该网站有某种格式，因此无法抓取。以我的经验，当它被阻止时，格式看起来会有所不同，它位于图像或动态项目中，或者其他更难抓取的东西。有人对此有什么想法吗？

# -*- coding: utf-8 -*-

# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup

### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.

country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names? 
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the 
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)

# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format

soup=BeautifulSoup(page.content,"html.parser")

################################################# ################ 开始我不确定的事情

 # Locate element on page to be scraped
 # find() locates the element in the BeautifulSoup object

 1. First method      

 population = soup.find(id="basic-facts", class="data-cell") 
 #I tried some methods like this. got only errors

 2. Second method

 populaiton = soup.findAll("h2", {"data-population": ""})
 for i in population:
 print i

 # this returns the headings for the table but no data

 ### here we need to take out the population data
 ### it is listed as "<h2 data-population = "" >35.8</h2>"

################################################# ＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃结尾

# Extract text from the selected BeautifulSoup object using .text
population = population.text

#Final Output
#Return Scraped info

print 'The Population of'+country+'is'+population

我用####### 概述了代码。我尝试了几种方法。我列出了两个

总的来说，我对编码很陌生，所以如果我描述得不好，请原谅，感谢任何人提供的任何建议。

【问题讨论】：

您总是可以先检查您阅读的内容的 .text 并检查您是否正确获得了实际页面...

标签： python web-scraping beautifulsoup

【解决方案1】：

它是从您可以在网络选项卡中找到的 API 调用动态检索的。由于您使用的不是浏览器，本应为您进行此调用，因此您需要自己直接发出请求。

import requests

r = requests.get('https://www.census.gov/popclock/apiData_pop.php?get=POP,MPOP0_4,MPOP5_9,MPOP10_14,MPOP15_19,MPOP20_24,MPOP25_29,MPOP30_34,MPOP35_39,MPOP40_44,MPOP45_49,MPOP50_54,MPOP55_59,MPOP60_64,MPOP65_69,MPOP70_74,MPOP75_79,MPOP80_84,MPOP85_89,MPOP90_94,MPOP95_99,MPOP100_,FPOP0_4,FPOP5_9,FPOP10_14,FPOP15_19,FPOP20_24,FPOP25_29,FPOP30_34,FPOP35_39,FPOP40_44,FPOP45_49,FPOP50_54,FPOP55_59,FPOP60_64,FPOP65_69,FPOP70_74,FPOP75_79,FPOP80_84,FPOP85_89,FPOP90_94,FPOP95_99,FPOP100_&key=&YR=2019&FIPS=af').json()

data = list(zip(r[0],r[1]))
print(round(int(data[0][1])/100_0000,1))

【讨论】：

您怎么知道它是从 API 动态检索的？以及您如何知道从网络选项卡中获取哪个请求？我浪费了 20 分钟试图弄清楚为什么 h2 标签的结果是空的
@DarkLeader 关闭浏览器中的 javascript（或者有一个禁用 js 的配置文件）并在加载页面时与启用 js 的浏览器进行比较 - 禁用 js 时不存在很多内容。要找到正确的呼叫，请在网络选项卡中使用 Ctrl + F 来搜索短语/数字。您希望只出现在您感兴趣的数据中。请参阅 1 和 2。请记住 - 特别是数字，js 可能会导致页面上的数字四舍五入，而在源代码中，数字的格式不同。