【问题标题】:Scrape tables with python用python刮表
【发布时间】:2017-01-31 22:20:57
【问题描述】:

我正在尝试抓取表格并将其转换为 python 中的 data.tables,但我在美国的选举数据中运气不佳。这是我要抓取的数据的 html。

<tr class="type-republican">
<th class="results-name" scope="row"><a href="xxxxx"><span class="name-combo"><span    class="token token-party"><abbr title="Republican">R</abbr></span> <span    class="token token-winner"><b aria-hidden="true" class="icon icon-check"></b>   <span class="icon-text">Winner</span></span> D. Trump</span></a></th>
<td class="results-percentage"><span class="percentage-combo"><span  class="number">62.9%</span><span class="graph"><span class="bar"><span class="index" style="width:62.9%;"></span></span></span></span></td>
<td class="results-popular">1,306,925</td>
<td class="delegates-cell">9</td>
</tr>
<tr class="type-democrat">
<th class="results-name" scope="row"><a href="xxxxxx"><span class="name-combo"><span   class="token token-party"><abbr title="Democratic">D</abbr></span> H.   Clinton</span></a></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">34.6%</span><span class="graph"><span class="bar"><span class="index" style="width:34.6%;"></span></span></span></span></td>
<td class="results-popular">718,084</td>
<td class="delegates-cell"></td>
</tr>
<tr class="type-independent">
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> G. Johnson</span></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">2.1%</span><span class="graph"><span class="bar"><span class="index" style="width:2.1%;"></span></span></span></span></td>
<td class="results-popular">43,869</td>
<td class="delegates-cell"></td>
</tr>
<tr class="type-independent">
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> J. Stein</span></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">0.4%</span><span class="graph"><span class="bar"><span class="index" style="width:0.4%;"></span></span></span></span></td>
<td class="results-popular">9,287</td>
<td class="delegates-cell"></td>
</tr>
</tbody>
</table>, <table class="results-table">
<tbody>
<tr class="type-republican">
<th class="results-name" scope="row"><a href="xxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Republican">R</abbr></span> D. Trump</span></a></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">73.4%</span><span class="graph"><span class="bar"><span class="index" style="width:73.4%;"></span></span></span></span></td>
<td class="results-popular">18,110</td>
</tr>
<tr class="type-democrat">
<th class="results-name" scope="row"><a href="xxxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Democratic">D</abbr></span> H. Clinton</span></a></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">24.0%</span><span class="graph"><span class="bar"><span class="index" style="width:24.0%;"></span></span></span></span></td>
<td class="results-popular">5,908</td>
</tr>
<tr class="type-independent">
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> G. Johnson</span></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">2.2%</span><span class="graph"><span class="bar"><span class="index" style="width:2.2%;"></span></span></span></span></td>
<td class="results-popular">538</td>
</tr>
<tr class="type-independent">
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> J. Stein</span></th>
<td class="results-percentage"><span class="percentage-combo"><span class="number">0.4%</span><span class="graph"><span class="bar"><span class="index" style="width:0.4%;"></span></span></span></span></td>
<td class="results-popular">105</td>
</tr>
</tbody>

等等…… 所以我的代码看起来像这样。

Percentage = []
Count = []
page = requests.get('xxxx')
soup = BeautifulSoup(page.text, "lxml")
table = soup.find('div', class_='content-alpha')
for row in table.find_all('tr'):
    col = row.find_all('td')
    Percentage = col[0].find(text=True)
    Count = col[1].find(text=True
    print (Count)

但我在这里得到的只是几张表的信息,但不是全部。如何从所有表格中获取信息?为什么我只能从几个表中获取信息?

我希望你能理解这个问题。

HTML 真的很大,所以我添加了指向网站http://www.politico.com/2016-election/results/map/president/alabama/ 的链接。我想抓取阿拉巴马州每个县的 2016 年美国大选数据

【问题讨论】:

  • 您的数据中不存在“content-alpha”类。你能更新你想要抓取的数据和预期的结果吗?
  • 如果您提供要抓取的网址,我们会更容易为您提供帮助
  • 我添加了网站的链接。

标签: python web-scraping beautifulsoup html-table


【解决方案1】:

所以过了一段时间,我设法从这个网站上抓取了所有数据。所以主要的问题是,那个网站是嵌入在 JavaScript 中的,所以我不能用 Beautifulsoup 抓取。所以我使用 selenium + beautifulsoup4,将页面转换为 html 并抓取它。

from selenium import webdriver
import time
import os
from bs4 import BeautifulSoup
chrome_path = r"C:\Users\Desktop\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get('http://www.politico.com/2016-election/primary/results/map/president/arizona/')
time.sleep(80)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
for posts in soup.findAll('table',{'class':'results-table'}):
for tr in posts.findAll('tr'):
    popular = [td for td in tr.stripped_strings]
    print(popular)

因为它是动态网页,所以我需要用 selenium 模拟一些东西。就像向下滚动页面一样。我使用了 time.sleep(60) 以便可以加载页面。它加载非常缓慢,所以我将时间设置为 60 秒。希望它可以帮助某人。

【讨论】:

    【解决方案2】:
    import requests, bs4
    
    r = requests.get('http://www.politico.com/2016-election/results/map/president/alabama/')
    soup = bs4.BeautifulSoup(r.text, 'lxml')
    contents = soup.find(class_='contrast-white')
    for table in contents.find_all(class_='results-group'):
        title = table.find(class_='title').text
        for tr in table.find_all('tr'):
            _, name, percentage, popular = [td for td in tr.stripped_strings]
            print(title, name, percentage, popular)
    

    出来:

    Autauga County D. Trump 73.4% 18,110
    Autauga County H. Clinton 24.0% 5,908
    Autauga County G. Johnson 2.2% 538
    Autauga County J. Stein 0.4% 105
    Baldwin County D. Trump 77.4% 72,780
    Baldwin County H. Clinton 19.6% 18,409
    Baldwin County G. Johnson 2.6% 2,448
    Baldwin County J. Stein 0.5% 453
    Barbour County D. Trump 52.3% 5,431
    Barbour County H. Clinton 46.7% 4,848
    Barbour County G. Johnson 0.9% 93
    Barbour County J. Stein 0.2% 18
    Bibb County D. Trump 77.0% 6,733
    Bibb County H. Clinton 21.4% 1,874
    Bibb County G. Johnson 1.4% 124
    Bibb County J. Stein 0.2% 17
    Blount County D. Trump 89.9% 22,808
    Blount County H. Clinton 8.5% 2,150
    Blount County G. Johnson 1.3% 337
    Blount County J. Stein 0.4% 89
    Bullock County H. Clinton 75.1% 3,530
    Bullock County D. Trump 24.2% 1,139
    Bullock County G. Johnson 0.5% 22
    Bullock County J. Stein 0.2% 10
    Butler County D. Trump 56.3% 4,891
    Butler County H. Clinton 42.8% 3,716
    Butler County G. Johnson 0.7% 65
    Butler County J. Stein 0.1% 13
    Calhoun County D. Trump 69.2% 32,803
    Calhoun County H. Clinton 27.9% 13,197
    Calhoun County G. Johnson 2.4% 1,114
    Calhoun County J. Stein 0.6% 262
    Chambers County D. Trump 56.6% 7,803
    Chambers County H. Clinton 41.8% 5,763
    Chambers County G. Johnson 1.2% 168
    Chambers County J. Stein 0.3% 44
    Cherokee County D. Trump 83.9% 8,809
    Cherokee County H. Clinton 14.5% 1,524
    Cherokee County G. Johnson 1.4% 145
    Cherokee County J. Stein 0.2% 25
    

    其余的都是空的,里面什么都没有。

    【讨论】:

    • 感谢您的回答。我是 python 新手,所以有同样的问题,为什么它只抓取部分页面,直到切罗基县?
    • 所以没有办法刮掉其余的县?
    • @Extria 因为页面中没有任何信息,我们无法从无到有。
    • 但是如果数据显示在网站上,应该有办法刮还是没有?
    猜你喜欢
    • 2016-11-29
    • 2020-07-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-08-15
    • 2023-03-06
    相关资源
    最近更新 更多