如何抓取没有任何源代码的数据？答案

【问题标题】：How can I scrape data which is not having any of the source code?如何抓取没有任何源代码的数据？
【发布时间】：2019-01-05 11:04:50
【问题描述】：

scrape.py

# code to scrape the links from the html

from bs4 import BeautifulSoup
import urllib.request

data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links

links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):

    # print(div.a.get('href'))
    links.append('https://godamwale.com' + str(div.a.get('href')))


print(links)
file = open("links.txt", "w")
for link in links:

    file.write(link + '\n')
    print(link)

我已使用此代码成功获取链接列表。但是当我想从他们的 html 页面中从这些链接中抓取数据时，这些没有任何包含数据的源代码，并且提取它们是我的工作艰难。我使用过 selenium driver ，但对我来说效果不佳。我想从下面的链接中抓取数据，其中包含 html 部分中的数据，其中包含客户详细信息、许可证和自动化、商业详细信息、楼层明智、操作详细信息。我想提取这些带有姓名、位置、联系电话和类型的数据。

https://godamwale.com/list/result/591359c0d6b269eecc1d8933

这里是链接。如果有人找到解决方案，请给我。

【问题讨论】：

以前有人做过吗？
“哪个没有源代码”没看懂？什么意思详细解释一下
当我使用 ctrl + u 查看源代码时，它只显示其中没有数据的代码，但我想废弃数据，当我检查时找到数据代码。
你说你有链接，但没有提到你接下来想做什么
我想从这些链接中删除数据，一个一个地把它们放到一个excel文件中

标签： python-3.x web-scraping beautifulsoup

【解决方案1】：

在您的浏览器中使用开发者工具，您会注意到，每当您访问该链接时，都会有一个对 https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 的请求，该请求会返回一个可能包含您正在查找的数据的 json 响应。

Python 2.x：

import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents

Python 3.x：

import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)

【讨论】：

Traceback（最近一次调用最后）：文件“scrapFile.py”，第 48 行，在内容 = json.loads(urllib.request.urlopen("godamwale.com/public/warehouse/…) 文件"/ usr/lib/python3.5/json/__init__.py"，第 312 行，加载 s.__class__.__name__)) 类型错误：JSON 对象必须是 str，而不是 'bytes'
在我运行您的代码时显示此错误
它在 Python3.6.6 上适用于我（您使用的是哪个 python 版本？），但是我猜测为什么它可能不适合您并更新了我的答案。您可能想检查以下内容以获得更强大的解决方案stackoverflow.com/questions/32795460/…
@TrevorIanPeacock，您能快速解释一下您在哪里看到/发现它发出请求并返回 json 响应吗？
@chitown88 我还没有深入研究 javascript 以查看该特定站点在何处/如何发出该请求，但通过简单检查开发人员控制台中发出的请求，我可以看到请求和响应.请参阅以下有关在浏览器中打开开发人员工具的链接。如果您想准确确定拨打电话的地点/方式，这也是我要开始的地方。 developer.mozilla.org/en-US/docs/Learn/Common_questions/…

【解决方案2】：

给你，网站的主要问题似乎是加载需要时间，这就是它返回不完整页面源的原因。您必须等到页面完全加载。注意time.sleep(8)下面代码中的这一行：

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time

CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe" 

wd = webdriver.Chrome(CHROMEDRIVER_PATH)

responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")

time.sleep(8)  # wait untill page loads completely 

soup = BeautifulSoup(wd.page_source, 'lxml')

props_list = []
propvalues_list = []

div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
    props = childtags.find("span").contents
    props_list.append(props)

    propvalue = childtags.find("p",recursive=True).contents
    propvalues_list.append(propvalue)

print(props_list)
print(propvalues_list)

注意：代码将在 2 个单独的列表中返回构造详细信息。

【讨论】：