【问题标题】:Scraping data from a site that has no form tag but a text input using python从没有表单标签但使用 python 输入文本的站点抓取数据
【发布时间】:2017-05-07 00:52:36
【问题描述】:

我正在开发一个 python 程序来从here 中抓取数据。我以前也有过成功,但这一次对我来说是一个挑战。我正在使用漂亮的汤和机械化。我需要能够在文本框中输入邮政编码以生成我想要的结果。

这是包含输入文本框的 sn-p:

<div id="ContentPlaceHolder1_C001_pnlFindACenter" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ContentPlaceHolder1_C001_btnSearchClient')">
		
        <div style="width: 400px; float: left; padding-top: 5px;">
            <label for="ContentPlaceHolder1_C001_tbUserAddress" style="font-family: Arial; font-size: 13.3333px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-decoration: none; text-transform: none; color: rgb(0, 0, 0); cursor: auto; display: inline-block; position: relative; z-index: 100; margin-right: -121px; left: 2px; top: 0px; opacity: 1;">Address, City or Zip:</label><input name="ctl00$ContentPlaceHolder1$C001$tbUserAddress" type="text" id="ContentPlaceHolder1_C001_tbUserAddress" class="textInField" style="width: 240px; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAPhJREFUOBHlU70KgzAQPlMhEvoQTg6OPoOjT+JWOnRqkUKHgqWP4OQbOPokTk6OTkVULNSLVc62oJmbIdzd95NcuGjX2/3YVI/Ts+t0WLE2ut5xsQ0O+90F6UxFjAI8qNcEGONia08e6MNONYwCS7EQAizLmtGUDEzTBNd1fxsYhjEBnHPQNG3KKTYV34F8ec/zwHEciOMYyrIE3/ehKAqIoggo9inGXKmFXwbyBkmSQJqmUNe15IRhCG3byphitm1/eUzDM4qR0TTNjEixGdAnSi3keS5vSk2UDKqqgizLqB4YzvassiKhGtZ/jDMtLOnHz7TE+yf8BaDZXA509yeBAAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" data-hasqtip="21" oldtitle="Address, City or Zip:" title="" autocomplete="off" aria-describedby="qtip-21">
            <div id="divDistance" style="display: inline;">
                &nbsp;&nbsp;within&nbsp;&nbsp;
            <select name="ctl00$ContentPlaceHolder1$C001$ddlRadius" id="ContentPlaceHolder1_C001_ddlRadius">
			<option value="5">5</option>
			<option value="10">10</option>
			<option selected="selected" value="25">25</option>
			<option value="50">50</option>
			<option value="100">100</option>

		</select>
                miles
            </div>
        </div>
        <div style="width: 160px; float: left;">
            &nbsp;&nbsp;&nbsp;
            <input type="submit" name="ctl00$ContentPlaceHolder1$C001$btnSearchClient" value="Search" onclick="GeocodeLocation();return false;WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$ContentPlaceHolder1$C001$btnSearchClient&quot;, &quot;&quot;, false, &quot;&quot;, &quot;find-a-center&quot;, false, false))" id="ContentPlaceHolder1_C001_btnSearchClient" class="btnCenter">
        </div>
        <div style="clear: both;">
        </div>
        <div>
            <span onchange="" style="font-size:12px;display: inline;" data-hasqtip="22" oldtitle="<b>AASM SleepTM</b> is an innovative telemedicine system that brings your sleep doctor to you. Featuring a secure, web-based video platform, AASM SleepTM allows you to meet with your sleep doctor from a distance. These live video visits will save you time and money. AASM SleepTM also syncs with Fitbit sleep data and has an integrated sleep diary, enabling you and your doctor to monitor your sleep." title="" aria-describedby="qtip-22"><input id="ContentPlaceHolder1_C001_chkSleepTM" type="checkbox" name="ctl00$ContentPlaceHolder1$C001$chkSleepTM"><label for="ContentPlaceHolder1_C001_chkSleepTM">Only show AASM SleepTM capable sleep centers in my state</label></span>
            <a href="https://sleeptm.com/" style="font-size: 10px; margin-left: 10px; display: inline;" target="_blank" data-hasqtip="23" oldtitle="<b>AASM SleepTM</b> is an innovative telemedicine system that brings your sleep doctor to you. Featuring a secure, web-based video platform, AASM SleepTM allows you to meet with your sleep doctor from a distance. These live video visits will save you time and money. AASM SleepTM also syncs with Fitbit sleep data and has an integrated sleep diary, enabling you and your doctor to monitor your sleep." title="" aria-describedby="qtip-23">What is AASM SleepTM?</a>
        </div>
    
	</div>

到目前为止,这些都是我的尝试

url = 'http://www.sleepeducation.org/find-a-facility'
MILES = '100'
CODE = '33060'

尝试一下

first = urllib2.Request(url,
                   data=urllib.urlencode({'value': CODE}),
                   headers={'User-Agent' : 'Google Chrome'                             'Cookie': 'name = ctl00$ContentPlaceHolder1$C001$tbUserAddress'})

尝试两次

post_params = {
       'ctl00$ContentPlaceHolder1$C001$tbUserAddress': CODE
}
first = urllib.urlencode(post_params)

driver = webdriver.Chrome()
driver.get(url)
sbox = driver.find_element_by_class_name("ctl00$ContentPlaceHolder1$C001$tbUserAddress")
sbox.send_keys(CODE)
        driver.find_element_by_class_name("ctl00$ContentPlaceHolder1$C001$btnSearchClient").click()

尝试 3

br = mechanize.Browser()
br.open(url)
br.select_form(name='ctl00$ContentPlaceHolder1$C001$tbUserAddress')
br['value'] = CODE
br.submit()

http = urllib2.urlopen(br.response())
soup = BeautifulSoup(http, "html5lib")

Error = "没有匹配名称的表单 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'"


尝试 4

soup.find('input', {'name': 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'})['value'] = CODE
soup.find('input', {'name': 'ctl00$ContentPlaceHolder1$C001$btnSearchClient'}).click()

【问题讨论】:

    标签: python html beautifulsoup mechanize


    【解决方案1】:

    如果我正确理解您的问题,您想发送带有特定参数的请求并检查响应。 好的,让我们看看提交后发送的请求。 让我们打开 Postman.Post request params

    我们可以看到 ctl00$ContentPlaceHolder1$C001$tbUserAddress 得到值 100,ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadiusctl00$ContentPlaceHolder1$C001$ ddlRadius, ctl00$cphTopBar$T917BC451013$rblRadius 得到半径值 25。

    所以让我们获取一些带有数据的 sn-p 以发送 post 请求并获得所需的响应

    我使用 python 请求

    和lxml来解析html响应

    我更喜欢 lxml,它更难理解,但比 BeautifulSoup 快得多。

    import requests
    from lxml import html
    
    input_data = {
        'ctl00$cphTopBar$T917BC451013$rblRadius': 25,
        'ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius': 25,
        'ctl00$ContentPlaceHolder1$C001$ddlRadius': 25,
        'ctl00$ContentPlaceHolder1$C001$tbUserAddress': 100
    }
    resp = requests.post('http://www.sleepeducation.org/find-a-facility', data=input_data)
    tree = html.fromstring(resp.text)
    print(tree.xpath('//div[@id="ContentPlaceHolder1_C001_map_canvas"]')[0])
    

    我没有足够的声誉来放置文档链接,我会尝试将它们放在 cmets 中,或者你可以谷歌 python requestspython lxml 你也可以用 BeautifulSoup 做到这一点

    import BeautifulSoup
    import requests
    
    input_data = {
            'ctl00$cphTopBar$T917BC451013$rblRadius': 25,
            'ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius': 25,
            'ctl00$ContentPlaceHolder1$C001$ddlRadius': 25,
            'ctl00$ContentPlaceHolder1$C001$tbUserAddress': 100
        }
    resp = requests.post('http://www.sleepeducation.org/find-a-facility', data=input_data)
    soup = BeautifulSoup.BeautifulSoup(resp.text)
    soup.find("div", {"id": "ContentPlaceHolder1_C001_map_canvas"})
    

    【讨论】:

    • 得到这个错误:- 'module' 对象没有属性 'fromstring'
    • 我似乎无法安装 lxml
    • @KelvinNjeri lxml install 试试这个安装教程。你确定你使用 html.fromstring,因为你的错误就像你尝试使用 lxml.fromstring。在这个 sn-p 中,我使用 lxml==3.5.0 和 requests==2.11.1
    【解决方案2】:

    这对我有用

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = 'http://www.sleepeducation.org/find-a-facility'
    subButton = 'ContentPlaceHolder1_C001_btnSearchClient'
    addyName = 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'
    addyId = 'ContentPlaceHolder1_C001_tbUserAddress'
    
    def usingChromeSelenium():
        driver = webdriver.Chrome('C:\Users\documents\chromedriver.exe')
        driver.get(url)
        sleep(1)
        driver.find_element_by_name(addyName).send_keys(CODE)
        driver.find_element_by_id(subButton).click()
        sleep(1)
        html = driver.page_source
        return html
    
    results = usingChromeSelenium()
    soup = BeautifulSoup(results, "html.parser")
    

    对于“webdriver.Chrome()”,您必须下载 chrome.exe 应用程序文件并在括号中包含文件的路径,如果没有路径,它可能对您有用

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-01-15
      相关资源
      最近更新 更多