【问题标题】:Scrape variable inside CData with BeautifulSoup使用 BeautifulSoup 在 CData 中刮取变量
【发布时间】:2018-03-25 02:53:39
【问题描述】:

我有一个网页,其中包含我想在该网页的 CData 部分中抓取的以下数据。

<script type="text/javascript">//<![CDATA[ 

car.app =


{"lat":26.175625,"lon":-80.13808,"zoom":"13","yellow":"\/img\/icons\/yellow.png","cars":[{"CAR_ID":"715383","ID":"538070521","UID":"0","CARNAME":"MAZDA","TYPE_COLOR":"0","LAT":"26.13437","LON":"-80.11906","COURSE":"100","SPEED":"0","LENGTH":"12","STATE":"OH"}] 

... 
... 
//]]></script>

我想在 CData 中获取 car.app 变量,但我不确定如何在 python 中解析它。

import bs4 as bs

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open(url)

c = response.read()
soup = bs.BeautifulSoup(c, "html.parser")
print(soup)

【问题讨论】:

    标签: python beautifulsoup cdata


    【解决方案1】:

    我认为解决您的问题的唯一方法是使用 BeautifulSoup 解析该特定标签,然后进行一些字符串操作以实现您的目标。

    代码:

    import bs4 as bs
    import urllib.request
    
    c = '''
    <script type="text/javascript">//<![CDATA[ 
    
    car.app =
    
    
    {"lat":26.175625,"lon":-80.13808,"zoom":"13","yellow":"\/img\/icons\/yellow.png","cars":[{"CAR_ID":"715383","ID":"538070521","UID":"0","CARNAME":"MAZDA","TYPE_COLOR":"0","LAT":"26.13437","LON":"-80.11906","COURSE":"100","SPEED":"0","LENGTH":"12","STATE":"OH"}] 
    
    ... 
    ... 
    //]]></script>
    '''
    soup = bs.BeautifulSoup(c, "html.parser")
    script = soup.find('script')
    print(str(script.text).split('car.app =')[1].split('...')[0].replace('\n', ''))
    

    输出:

    {"lat":26.175625,"lon":-80.13808,"zoom":"13","yellow":"\/img\/icons\/yellow.png","cars":[{"CAR_ID":"715383","ID":"538070521","UID":"0","CARNAME":"MAZDA","TYPE_COLOR":"0","LAT":"26.13437","LON":"-80.11906","COURSE":"100","SPEED":"0","LENGTH":"12","STATE":"OH"}] 
    

    【讨论】:

    • 是的!我在想同样的方法,但无法想象如何写出来。谢谢@Ali!
    • 百夫长没问题 :)
    猜你喜欢
    • 2023-04-04
    • 2013-12-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-08-18
    • 1970-01-01
    • 2019-06-03
    • 1970-01-01
    相关资源
    最近更新 更多