如何使用 Python 检索网页的页面标题？答案

【问题标题】：How can I retrieve the page title of a webpage using Python?如何使用 Python 检索网页的页面标题？
【发布时间】：2010-09-08 06:01:36
【问题描述】：

如何使用 Python 检索网页的页面标题（title html 标签）？

【问题讨论】：

自从提出这个问题以来，很多网页都开始使用 og:title 元标记，其中包含原始标题，而经常带有其他数据的前缀和后缀。最初，仅 Facebook 将其用作 OpenGraph 的一部分，许多网站都提供 OpenGraph 元数据。 og:title 已成为页面标题的标准来源，尤其是新闻文章。

标签： python html

【解决方案1】：

这是@Vinko Vrsalovic's answer的简化版：

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

注意：

soup.title 在 html 文档中找到第一个 title 元素 anywhere
title.string 假设它只有一个子节点，并且该子节点是 string

对于beautifulsoup 4.x，使用不同的导入：

from bs4 import BeautifulSoup

【讨论】：

谢谢！万一有人遇到类似问题，在我的 Python3 环境中，我不得不使用urlllib.request 而不是urllib2。不知道为什么。为了避免关于我的解析器的 BeautifulSoup 警告，我必须这样做 soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")。
对于 python 3 使用 import urllib.request as urllib 而不是 import urllib2
请注意，如果缺少标题属性或空标题为<title></title>，执行soup.title.string 将返回None
@Eitanmg：确实，repl.it/@zed1/beautifulsoup-empty-title-is-none

【解决方案2】：

对于此类任务，我将始终使用 lxml。你也可以使用beautifulsoup。

import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)

根据评论编辑：

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)

【讨论】：

以防万一上面的代码出现 IOError：stackoverflow.com/questions/3116269/…
lxml may have issues with Unicode，你可以use bs4.UnicodeDammit to help it find the correct character encoding

【解决方案3】：

无需导入其他库。 Request 内置了此功能。

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

【讨论】：

通常，“导入其他库”似乎会导致更多工作。感谢您帮助我们避免这种情况！

【解决方案4】：

mechanize Browser 对象有一个 title() 方法。所以this post的代码可以改写为：

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

【讨论】：

【解决方案5】：

对于这样一个简单的任务，这可能是多余的，但如果你打算做更多的事情，那么从这些工具（机械化，BeautifulSoup）开始会更明智，因为它们比替代品（urllib to get）更容易使用内容和正则表达式或其他解析器来解析 html)

链接： BeautifulSoup mechanize

#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

【讨论】：

【解决方案6】：

使用HTMLParser:

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain

【讨论】：

值得注意的是，该脚本适用于 Python 3。HtmlParser 模块在 Python 3.x 中被重命名为 html.parser。类似地，在 Python 3 中添加了 urllib.request。
最好将字节显式转换为字符串r=urlopen(url)、encoding = r.info().get_content_charset() 和html_string = r.read().decode(encoding)。

【解决方案7】：

使用soup.select_one 定位标题标签

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

【讨论】：

【解决方案8】：

使用正则表达式

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

【讨论】：

究竟是什么.group(1)？有参考吗？
嗨，group(0) 将返回整个匹配项。参考match-objects。
这将错过任何标题标签不完全按照（大写，混合大小写，间距）的情况
如果标题标签中还有其他数据，我还会包含。

【解决方案9】：

soup.title.string 实际上返回一个 unicode 字符串。要将其转换为普通字符串，您需要执行 string=string.encode('ascii','ignore')

【讨论】：

这只会删除任何可能不是您想要的非 ascii 字符。如果您真的想要字节（encode 给出的内容）而不是字符串，请使用正确的 charset 进行编码。例如，string.encode('utf-8').

【解决方案10】：

这是一个容错HTMLParser 实现。
您可以在get_title() 上扔几乎任何东西而不会损坏，如果发生任何意外情况 get_title() 将返回 None。
当Parser() 下载页面时，它会将其编码为ASCII 无论页面中使用的字符集如何，都会忽略任何错误。更改 to_ascii() 以将数据转换为 UTF-8 或任何其他编码将是微不足道的。只需添加一个编码参数并将函数重命名为to_encoding()。
默认情况下，HTMLParser() 会在损坏的 html 上中断，它甚至会在不匹配标签等琐碎的事情上中断。为了防止这种行为，我将HTMLParser() 的错误方法替换为一个可以忽略错误的函数。

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)

def is_bytes(data):
    return isinstance(data, bytes)

def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data


class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True

    def handle_data(self, data):
        if self.rec:
            self.title = data

    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False


def get_title(url):
    return Parser(url).title

print(get_title('http://www.google.com'))

【讨论】：

【解决方案11】：

在Python3中，我们可以从urllib.request调用方法urlopen和从bs4库调用BeautifulSoup来获取页面标题。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

这里我们使用最高效的解析器“lxml”。

【讨论】：

【解决方案12】：

使用 lxml...

从根据 Facebook opengraph 协议标记的页面元获取它：

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

或将 .xpath 与 lxml 一起使用：

t = html_doc.xpath(".//title")[0].text

【讨论】：