wangchenghua

1、将从豆瓣网页爬虫采集到的数据,连接mongo数据库,将数据导入至mongo中,代码如下:

#  访问网址
# 使用requests 去访问
import pandas as pd
import requests
import pymongo
import re


u = \'https://book.douban.com/tag/哲学\'
r = requests.get(url=u)

# 解析网址
# 使用BeautifulSoup 解析网址
from bs4 import  BeautifulSoup
soup =BeautifulSoup(r.text,\'lxml\')
urlist =[]
for i in range(7):
    urlist.append(\'https://book.douban.com/tag/哲学?start=\' + str(20*i)+ \'&type=T\')
n=0
for u in urlist:
    r = requests.get(url=u)
    soup =BeautifulSoup(r.text,\'lxml\')
    soup.find(\'div\',id="content").h1.text
    lis = soup.find(\'ul\',class_=\'subject-list\').find_all(\'li\')
    for li in lis:
        dic = {}      # 创建空字典,存储数据
        dic[\'书名\']=li.h2.text.replace(\' \',\'\').replace(\'\n\',\'\')
        dic[\'其他信息\']=li.find(\'div\',class_="pub").text.replace(\' \',\'\').replace(\'\n\',\'\')
        dic[\'评分\']=li.find(\'span\',class_="rating_nums").text
        dic[\'评价人数\']=re.search(r\'(\d*)人\',li.find(\'span\',class_="pl").text.replace(\' \',\'\').replace(\'\n\',\'\')).group(1)
        datatable.insert_one(dic)     #  将每次获取的数据入库
        n +=1
        print("成功采集%i条数据"%n)

myclient = pymongo.MongoClient("mongodb://localhost:27017")
db = myclient[\'豆瓣数据采集\']
datatable = db[\'test\']
    # 创建了一个连接mongo数据库连接,并创建了一个豆瓣数据采集数据库,以及test表格
豆瓣数据爬虫

2、mongo的安装配置:https://www.cnblogs.com/zhoulifeng/p/9429597.html#4242074

3、ROBO 3T 安装:https://www.cnblogs.com/tugenhua0707/p/9250673.html

分类:

技术点:

相关文章: