【问题标题】:extract html using python beautiful soup is not working使用python美丽汤提取html不起作用
【发布时间】:2018-09-22 11:33:58
【问题描述】:

我想抓取按州和城市组织的信息

这是我正在使用的 Python 脚本

import requests
import html5lib
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import pyrebase
import numpy as np
import yagmail
import time
import math
import colorama
import sys
from algoliasearch import algoliasearch

from datetime import datetime, timedelta

def getVendors():
    req = requests.Session()

    defaultlink = 'https://www.collierreporting.com/'

    driver.get(defaultlink)

    vendorsoup = BeautifulSoup(driver.page_source,"html5lib");

    statecontainer = vendorsoup.find_all("li")

    for state in statecontainer:

        stateref = state.find('a')['href']
        statename = state.find('a').contents[0]

        driver.get(stateref)
        statesoup = BeautifulSoup(driver.page_source,"html5lib");

        #GET CITIES
        citycontainer = statesoup.find_all("p")

        for city in citycontainer:
            cityref = city.find('a')['href']
            cityname = city.find('a')

            print( cityref, cityname)

        print(statename)

    print('Get vendors')

getVendors()

我能够在这个 html 中抓取状态

         <div class="content">

        <div class="column_1">
            <ul>
                <li><a href="https://www.collierreporting.com/state/al">Alabama</a></li>
                <li><a href="https://www.collierreporting.com/state/ak">Alaska</a></li>
                <li><a href="https://www.collierreporting.com/state/az">Arizona</a></li>
                <li><a href="https://www.collierreporting.com/state/ak">Arkansas</a></li>
                <li><a href="https://www.collierreporting.com/state/ca">California</a></li>
              
            </ul>
        </div>

    </div>

但是当我尝试在这个 html 中抓取城市时,它不起作用

<div class="content">

<div class="column_1">
    <ul>
        <div style="margin-left: 20px;"><span style="font-style: italic;">Select a city to view dossiers.</span>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alabaster-al">Alabaster</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alexander-city-al">Alexander City</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alexandria-al">Alexandria</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/aliceville-al">Aliceville</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/andalusia-al">Andalusia</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/anniston-al">Anniston</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/arab-al">Arab</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/ardmore-al">Ardmore</a></p>
            <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/ashford-al">Ashford</a></p>
 
        </div>
    </ul>
</div>
</div>

这是我遇到的错误,不知道为什么

Traceback (most recent call last):
  File "vendors.py", line 120, in getVendors()
  File "vendors.py", line 101, in getVendors cityref = city.find('a')['href']
TypeError: 'NoneType' object is not subscriptable

我不知道为什么这不起作用。我尝试了多种获取 href 和城市名称的变体,但得到的只是相同的“对象不可下标”错误。

【问题讨论】:

    标签: javascript python html


    【解决方案1】:

    我更改城市容器以查找所有标签并能够找到如下内容

    citycontainer = statesoup.find_all("a")
    
    for city in citycontainer:
    
            cityref = city['href']
            cityname = city.contents[0]
    

    我不知道为什么不同,但它有效

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-10-21
      • 2016-09-11
      • 2021-05-24
      • 1970-01-01
      • 1970-01-01
      • 2012-01-07
      • 1970-01-01
      • 2020-01-19
      相关资源
      最近更新 更多