【问题标题】:Extract from webpage using bs4 and Python使用 bs4 和 Python 从网页中提取
【发布时间】:2017-10-10 09:15:55
【问题描述】:

我如何从下面的网站的 "Current stream number: 1" 中提取数字 1,看看我目前使用 python 和 bs4 的尝试,不成功

我要抓取的页面来源

<head><link href="basic.css" rel="stylesheet" type="text/css"></head>
<body>
<p><b>STATUS</b><br>
<p><b>Device information:</b><br>
Hardware type:  
Exstreamer 110
 (ID 20)<br>
<br>
Firmware: Streaming Client<br>
FW version: B2.17&nbsp;-&nbsp;31/05/2010 (dd/mm/yyyy)<br>
WEB version: 04.00<br>
Bootloader version: 99.19<br>
Setup version: 01.02<br>
Sg version: A8.05&nbsp;-&nbsp;May 31 2010<br>
Fs version: A2.05&nbsp;-&nbsp;31/05/2010 (dd/mm/yyyy)<br>
<p><b>System status:</b><br>
Ticks: 1588923494 ms<br>
Uptime: 10178858 s<br>
<p><b>Streaming status:</b><br>
Volume: 90%<br>
Shuffle:   Off<br>
Repeat:   Off<br>
Output peak level L: -63dBFS<br>
Output peak level R: -57dBFS<br>
Buffer level: 65532 bytes<br>
RTP decoder latency: 0 ms; average 0 ms<br>
Current stream number:   1   <br>
Current URL: http://listen.qkradio.com.au:8382/listen.mp3<br>
Current channel: 0<br>
Stream bitrate: 32 kbps<br>

代码:

from bs4 import BeautifulSoup
import urllib2
import lxml

SERVER = 'http://xx.xx.xx.xx:8080/ixstatus.html'
authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
authinfo.add_password(None, SERVER, 'user', 'password')
page = 'http://xxx.xxx.xxx.xxx:8080/ixstatus.html'
handler = urllib2.HTTPBasicAuthHandler(authinfo)
myopener = urllib2.build_opener(handler)
opened = urllib2.install_opener(myopener)
output = urllib2.urlopen(page)
#print output.read()
soup = BeautifulSoup(output.read(), "lxml")
#print(soup)

print "stream number:", soup.select('Current stream number')[0].text

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    您对select 的调用使BS4 使用CSS 选择器来查找不存在的东西。 &lt;number&gt;&lt;stream&gt; 内,在 &lt;Current&gt; 元素内。

    由于 html 代码没有可用于定位所需数据的类或 id 属性。您(可能)最好的选择是浏览段落并使用正则表达式查找子字符串,例如:Current stream number: some_number

    我会这样做:

    import re
    import bs4
    
    page = "html code to scrape"
    
    # this pattern will be used to find data we want
    pattern = r'\s*Current\s+stream\s+number:\s*(\d+)'
    
    soup = bs4.BeautifulSoup(page, 'lxml')
    
    paragraphs = soup.findAll('p')
    data = []
    for para in paragraphs:
        found = re.finditer(pattern, para.text, re.IGNORECASE);
    
        data.extend([x.group(1) for x in found])
    
    
    print(data)
    

    【讨论】:

    • 我得到的响应是[u'1'],请问我怎么只能得到值1
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-07-28
    • 1970-01-01
    • 2018-03-20
    • 1970-01-01
    • 2019-01-15
    相关资源
    最近更新 更多