【问题标题】:Using selenium with python, how can I get Var from HTML where it's declared in a JS <script> element在 python 中使用 selenium,如何从 HTML 中获取 Var,它在 JS <script> 元素中声明
【发布时间】:2019-03-29 01:06:13
【问题描述】:

我想在 htm 的 JS 中声明 var;。 但是没有id,元素。我怎样才能得到这些数据?

因为没有地址,只有var name,不知道怎么弄

网站 HTML:

<script type="text/javascript">
var imgInfoData = 'data which i want to crawl'

</script>

我的python代码:

#set url
HOMEPAGE = "https://land.naver.com/info/complexGallery.nhn?newComplex=Y&startImage=Y&rletNo=102235"


#open web
driver = webdriver.Firefox()
driver.wait = WebDriverWait(driver, 2)
driver.get(HOMEPAGE)

#try to get text from html
time.sleep(1)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, '//script["var"]'))).text

【问题讨论】:

标签: javascript python selenium selenium-webdriver web-crawler


【解决方案1】:

我检查了你正在抓取的网站,似乎脚本已经包含在 html 页面中,所以我认为你不需要使用 webdriver,你可以使用 requestsbeautifulsoup

使用请求获取 html 数据:

res = requests.get(url, headers=headers, params=params)

然后Soup html文本得到脚本标签,找出哪些标签有var imgInfoData

soup = BeautifulSoup(res.text, "html5lib")
    scripts = soup.findAll('script', attrs={'type':'text/javascript'})
    for script in scripts:
        if "var imgInfoData" in script.text: #script with imgInfoData captured
            return script.text.replace("var imgInfoData =","").strip()[:-1]

只需删除

var imgInfoData =

;

文本来获取字符串值,或者您可以使用 regex 来获取文本中的 json 字符串。

完整代码:

import requests
from bs4 import BeautifulSoup

def getimgInfoData():
    url = "https://land.naver.com/info/complexGallery.nhn"
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    params = {"newComplex":"Y",
              "startImage":"Y",
              "rletNo":"102235"}
    res = requests.get(url, headers=headers, params=params)

    soup = BeautifulSoup(res.text, "html5lib")
    scripts = soup.findAll('script', attrs={'type':'text/javascript'})
    for script in scripts:
        if "var imgInfoData" in script.text: #script with imgInfoData captured
            return script.text.replace("var imgInfoData =","").strip()[:-1]
    return None

print(getimgInfoData())

如果需要,只需将结果从 getimgInfoData() 转换为 json

【讨论】:

    猜你喜欢
    • 2021-08-15
    • 2019-07-21
    • 2022-01-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-03-29
    • 2022-12-11
    • 1970-01-01
    相关资源
    最近更新 更多