这里面通过爬虫github上的一些start比较高的python项目来学习一下BeautifulSoup和pymysql的使用。
github的python爬虫
爬虫的需求:爬取github上有关python的优质项目,以下是测试用例,并没有爬取很多数据。
一、实现基础功能的爬虫版本
这个案例可以学习到关于pymysql的批量插入、使用BeautifulSoup解析html数据以及requests库的get请求数据的知识。至于pymysql的一些使用,可以参考博客:python框架---->pymysql的使用
import requests
import pymysql.cursors
from bs4 import BeautifulSoup
def get_effect_data(data):
results = list()
soup = BeautifulSoup(data, \'html.parser\')
projects = soup.find_all(\'div\', class_=\'repo-list-item\')
for project in projects:
writer_project = project.find(\'a\', attrs={\'class\': \'v-align-middle\'})[\'href\'].strip()
project_language = project.find(\'div\', attrs={\'class\': \'d-table-cell col-2 text-gray pt-2\'}).get_text().strip()
project_starts = project.find(\'a\', attrs={\'class\': \'muted-link\'}).get_text().strip()
update_desc = project.find(\'p\', attrs={\'class\': \'f6 text-gray mb-0 mt-2\'}).get_text().strip()
result = (writer_project.split(\'/\')[1], writer_project.split(\'/\')[2], project_language, project_starts, update_desc)
results.append(result)
return results
def get_response_data(page):
request_url = \'https://github.com/search\'
params = {\'o\': \'desc\', \'q\': \'python\', \'s\': \'stars\', \'type\': \'Repositories\', \'p\': page}
resp = requests.get(request_url, params)
return resp.text
def insert_datas(data):
connection = pymysql.connect(host=\'localhost\',
user=\'root\',
password=\'root\',
db=\'test\',
charset=\'utf8mb4\',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
sql = \'insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)\'
cursor.executemany(sql, data)
connection.commit()
except:
connection.close()
if __name__ == \'__main__\':
total_page = 2 # 爬虫数据的总页数
datas = list()
for page in range(total_page):
res_data = get_response_data(page + 1)
data = get_effect_data(res_data)
datas += data
insert_datas(datas)
运行完之后,可以在数据库中看到如下的数据:
| 11 | tensorflow | tensorflow | C++ | 78.7k | Updated Nov 22, 2017 |
| 12 | robbyrussell | oh-my-zsh | Shell | 62.2k | Updated Nov 21, 2017 |
| 13 | vinta | awesome-python | Python | 41.4k | Updated Nov 20, 2017 |
| 14 | jakubroztocil | httpie | Python | 32.7k | Updated Nov 18, 2017 |
| 15 | nvbn | thefuck | Python | 32.2k | Updated Nov 17, 2017 |
| 16 | pallets | flask | Python | 31.1k | Updated Nov 15, 2017 |
| 17 | django | django | Python | 29.8k | Updated Nov 22, 2017 |
| 18 | requests | requests | Python | 28.7k | Updated Nov 21, 2017 |
| 19 | blueimp | jQuery-File-Upload | JavaScript | 27.9k | Updated Nov 20, 2017 |
| 20 | ansible | ansible | Python | 26.8k | Updated Nov 22, 2017 |
| 21 | justjavac | free-programming-books-zh_CN | JavaScript | 24.7k | Updated Nov 16, 2017 |
| 22 | scrapy | scrapy | Python | 24k | Updated Nov 22, 2017 |
| 23 | scikit-learn | scikit-learn | Python | 23.1k | Updated Nov 22, 2017 |
| 24 | fchollet | keras | Python | 22k | Updated Nov 21, 2017 |
| 25 | donnemartin | system-design-primer | Python | 21k | Updated Nov 20, 2017 |
| 26 | certbot | certbot | Python | 20.1k | Updated Nov 20, 2017 |
| 27 | aymericdamien | TensorFlow-Examples | Jupyter Notebook | 18.1k | Updated Nov 8, 2017 |
| 28 | tornadoweb | tornado | Python | 14.6k | Updated Nov 17, 2017 |
| 29 | python | cpython | Python | 14.4k | Updated Nov 22, 2017 |
| 30 | Python | 14.2k | Updated Oct 17, 2017 |
参考:https://www.cnblogs.com/huhx/p/usepythongithubspider.html