HTML 解析：在 Python 中从源代码到文本答案

【问题标题】：HTML Parsing: from source to text in PythonHTML 解析：在 Python 中从源代码到文本
【发布时间】：2016-04-17 14:37:13
【问题描述】：

我已经阅读了这个问题 (Python HTML parsing from url)，但我还没有理解一些东西。这是代码：

import urllib.request
from html.parser import HTMLParser
    # create a subclass and override the handler methods
    class MyHTMLParser(HTMLParser):
            def handle_starttag(self, tag, attrs):
                print ("Encountered a start tag:"+ tag)
            def handle_endtag(self, tag):
                print ("Encountered an end tag :"+ tag)
            def handle_data(self, data):
                print ("Encountered some data  :"+ data)
    parser = MyHTMLParser()
    info = "http://www.calendario-365.it/js/365.php?page=moon"
    response = urllib.request.urlopen(info)
    content = response.read()
    parser.feed(str(content))

将此代码应用于我的网站给了我这个： http://pastebin.com/m4YV38uM 我想保存到变量中

10,6 乔尼

82%

怎么样？感谢您的回答。 Python版本：3.5。

【问题讨论】：

标签： python html parsing python-3.x

【解决方案1】：

虽然正则表达式很简洁，并且像 LXML 和 Beautifulsoup 这样的解析器很方便，在这个特殊的问题中，我不介意使用 HTMLParser。即使你最终没有使用它，这里是程序。使用它有一点微妙之处。如果目标元素如下（因为你没有展示我假设的实际元素）

<div id="x" class="y"> 82% </div>

然后，实现如下方法

Class My(HTMLParser):
    def __init__(self):
        self.percent = 0
        self.flag = False

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs) # {"id": someid, "class": someclass}
        if attrs.get("id") == "x": # or attrs.get("class") == "y"
            self.flag = True # we entered the target

    def handle_data(self, data):
        if self.flag == True: # we are inside our target
            self.percent = data # do str -> int conversion

    def handle_endtag(self, tag):
        if self.flag == True: # reached the end of target
            self.flag = False

对于您要捕获的每个值

添加实例属性（如self.percent）
添加标志（如self.flag）
在三个方法中实现对应的逻辑：识别入口、提取数据和识别出口。

【讨论】：

【解决方案2】：

好的，如果您正在寻找一个简单的解决方案，您可以在结果上运行一个正则表达式，或者使用它来限制您的输出。我无法确定您是如何输出这些数据的，但您可能想尝试以下模式：

"\d?\d,\d giorni"
"\d?\d%"

第一个应该找到任何一位或两位数字后跟逗号和另一个数字的模式，第二个是一位或两位数字后跟 %。您也可以使用“+”或“*”运算符，具体取决于输入的可变性。

【讨论】：

这不是答案...如果您打算从 OP 获取更多信息，请将其发布为 cmets，否则，如果您不能，请不要将其发布为答案。
这两个元素每天都在变化，例如“82%”后面可以是“54%”或“100%”。也许“％”之前的两个（或三个）数字？或者也许是“giorni”之前的三位数字？我不知道是否有真正的标准:)
@铁拳：我的错，我是新来的。这不是一个答案。似乎它不允许我发表评论，因为地位低...
什么？我不明白
当我尝试对原始帖子发表评论时，它告诉我需要 50 声望才能发表评论。但我现在已经把我的问题编辑成了答案。

【解决方案3】：

试试这个：

# -- coding: UTF-8 --
import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.calendario-365.it/js/365.php?page=moon').read()

soup = BeautifulSoup(page)

print(soup.find(text='Età della Luna:').findNext('div').text)

print(soup.find(text='Percentuale visibile:').findNext('div').text)

输出：

10,6 giorni
82%

【讨论】：

语法错误：打印 soup.find(text='Età della Luna:').findNext('div').text
可能您使用的是 python3，请参阅更新后的答案。还要确保你有 bs4 。
能给我Beautifulsoup的下载链接吗？因为我没有那些模块
return super(HTMLParserTreeBuilder, self).__init__(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'strict' 什么？
这是代码：（我使用 urllib.request 因为我没有 urllib2）：pastebin.com/bRsAavCd