BeautifulSoup 无法解析果阿大学网站答案

【问题标题】：BeautifulSoup unable to parse Goa University siteBeautifulSoup 无法解析果阿大学网站
【发布时间】：2017-02-14 08:05:12
【问题描述】：

我正在做一个解析项目，需要我解析教育网站。这样做时，我的代码无法解析 University of Goa 站点。它没有按预期返回。我的代码：

from bs4 import BeautifulSoup
import requests

hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (\
    KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}    

r = requests.get(url, verify=True, headers=hdrs)
result = BeautifulSoup(r.content)
print(result)

打印出来：

<html><head><script type="text/javascript">
    document.location="https://www.unigoa.ac.in/result_redirect.php";
</script>
</head></html>

而不是原始的 html 解析树。我尝试将显式解析器lxml 和html5lib 传递给BeautifulSoup，但它也无法按预期工作。请帮助我。提前致谢。

【问题讨论】：

那是原始解析的html树。尝试将其保存到一个 html 文件并在浏览器中打开它......（只是为了了解它的作用）

标签： python python-2.7 python-3.x parsing beautifulsoup

【解决方案1】：

您需要创建一个会话，然后解析并使用重定向 url：

with requests.Session() as s:
    s.headers.update(hdrs)
    r = s.get("https://www.unigoa.ac.in")
    result = BeautifulSoup(r.content)
    redirect = result.find("script").text.split("=")[1].strip('";\r\n')
    r2 = s.get(redirect)
    print(r2.text)

r2.text 将为您提供您在主页上看到的 html。

【讨论】：

我猜，你编辑了你的答案。顺便说一句，以前的答案也用r2.content 而不是r2.text 解决了我的问题。非常感谢您的快速回复... :)
@OmPrakash，不用担心，我只是更改为 .text，因为在 python3 中您会看到格式正确的文本，使用 .content 您会看到一个单字节字符串。
哦，我明白了。我不知道.text 和.content 之间的这种区别。再次感谢您。
@OmPrakash：见How does accepting an answer work?