使用 bs4 解析带有空白 src 的 iframe答案

【问题标题】：Parse iframe with blank src using bs4使用 bs4 解析带有空白 src 的 iframe
【发布时间】：2016-06-27 06:27:47
【问题描述】：

一天中的美好时光，SO 社区。这是我最近遇到的问题：

我在主页上得到了这个 HTML 源代码：

  <div id="contents_layout">

  <iframe name="contentsFrame" id="contentsFrameID" src="" 
  width="100%" height="100%" scrolling="no" frameborder="0" 
  marginheight="0" marginwidth="0"></iframe>

  </div>

我已经阅读了很多关于解析 iframe 的材料，但他们所做的只是从 iframe 获取 src 属性，然后再发出另一个请求。我不能在这里做同样的伎俩，因为 src 属性是空白的，下面是 web 逻辑。

我正在使用 Python 3.5、bs4 和请求。

页面源代码-http://collabedit.com/kqp88 框架源码-http://collabedit.com/hwuj7

不知道能不能分享原网页...

【问题讨论】：

拿到 iframe 后想做什么？
@PadraicCunningham 当然，我想解析它的内容。
我在任何一个源中都看不到 contents_layout 或 contentsFrameID
没错，我也是。这就是问题所在 - 我只是无法区分 iframe 加载的页面链接是什么。当我查看源代码时，它全是空白的，并且只有变量名，例如 contentFrameID 或 contentTextID。
第二个链接中的 iframe 的 id 为 vis_frame，这是你想要的吗？还有第一个链接如何适应这个？

标签： python parsing iframe beautifulsoup

【解决方案1】：

iframe 有一个 id，所以只需使用它：

h= """<div id="contents_layout">

  <iframe name="contentsFrame" id="contentsFrameID" src=""
  width="100%" height="100%" scrolling="no" frameborder="0"
  marginheight="0" marginwidth="0"></iframe>

  </div>
"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(h)

iframe = soup.select_one("#contentsFrameID")

这会给你：

<iframe frameborder="0" height="100%" id="contentsFrameID" marginheight="0" marginwidth="0" name="contentsFrame" scrolling="no" src="" width="100%"></iframe>

你也可以使用空的 src 属性：

ifr = soup.select_one("iframe[src=""]")

使用名称：

 ifr = soup.select_one("iframe[name=contentsFrame]")

在您正在抓取的实际站点中，contentsFrameID 中的内容是动态创建的，因此您将需要类似 selenium 的内容，下面是获取动态创建表单的示例：

from selenium import webdriver
from bs4 import BeautifulSoup
dr =  webdriver.PhantomJS()

dr.get("http://encykorea.aks.ac.kr/Contents/Index?contents_id=E0000089")

soup = BeautifulSoup(dr.page_source)
print(soup.select_one("#contentFrameForm")

【讨论】：

非常感谢，但如果我print(iframe)，它会给我确切的<iframe> ... </iframe>。我在问题中添加了页面和 iframe 的源代码，你也看看这些吗？非常感谢您。