【发布时间】:2023-03-11 13:37:01
【问题描述】:
这是我使用的 curl 命令 -->
curl "https://api.coursera.org/api/courses.v1?start=1&limit=11?includes=instructorIds,partnerIds,specializations,s12nlds,v1Details,v2Details&fields=instructorIds,partnerIds,specializations,s12nlds,description"
我使用了查询参数 -start 和 limit,但它只是重复了 2150 门课程中的 100 门课程。这里是课程目录 API 的链接 -->
https://docs.google.com/document/d/15gwppUMLp0s1OhbzFZvFSeTbvFkRfSFIkiIKrEP6cUA/edit
Python 代码:
import requests
import json
from bs4 import BeautifulSoup
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
if __name__ == "__main__":
headers = ({
"x-user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36
FKUA/website/41/website/Desktop"})
d = open('result.json', 'r')
data = json.load(d)
print(data)
d.close()
with open("coursera.csv", 'a') as f:
# Wrote the header once and toggle comment
header = f.write('instructorIds' + ',' + 'courseType' + ',' + 'name' + ',' + 'partnerIds' + ',' +
'slug' + ',' + 'specializations' + ',' + 'course_id' + ',' + 'description' + "\n")
for i in range(len(data['elements'])):
instructorIds = data['elements'][i]['instructorIds']
instructorIds = str(instructorIds)
if instructorIds:
instructorIds = instructorIds.rstrip().replace(',', '')
instructorIds = instructorIds.rstrip().replace('\n', '')
instructorIds = instructorIds.rstrip().replace('u', '')
instructorIds = instructorIds.rstrip().replace('[', '')
instructorIds = instructorIds.rstrip().replace(']', '')
else:
instructorIds = ' '
print(instructorIds)
courseType = data['elements'][i]['courseType']
courseType = str(courseType)
print(courseType)
name = data['elements'][i]['name']
name = str(name)
print(name)
partnerIds = data['elements'][i]['partnerIds']
partnerIds = str(partnerIds)
if partnerIds:
partnerIds = partnerIds.rstrip().replace(',', '')
partnerIds = partnerIds.rstrip().replace('\n', '')
partnerIds = partnerIds.rstrip().replace('u', '')
partnerIds = partnerIds.rstrip().replace('[', '')
partnerIds = partnerIds.rstrip().replace(']', '')
else:
partnerIds = ' '
print(partnerIds)
slug = data['elements'][i]['slug']
slug = str(slug)
print(slug)
specializations = data['elements'][i]['specializations']
specializations = str(specializations)
if specializations:
specializations = specializations.rstrip().replace(',', '')
specializations = specializations.rstrip().replace('\n', '')
specializations = specializations.rstrip().replace('u', '')
specializations = specializations.rstrip().replace('[', '')
specializations = specializations.rstrip().replace(']', '')
else:
specializations = ' '
print(specializations)
course_id = data['elements'][i]['id']
course_id = str(course_id)
print(course_id)
description = data['elements'][i]['description']
description = str(description)
print(description)
if description:
description = description.rstrip().replace(',', '')
description = description.rstrip().replace('\n', '')
else:
description = ' '
####################################################################
### writing the attributes in a csv file
f.write(instructorIds + ',' + courseType + ',' + name + ',' + partnerIds + ',' + slug + ',' + specializations + ',' + course_id + ',' + description + "\n")
请提出一种方法,我可以抓取所有课程。
【问题讨论】:
-
如果您添加有关您的实施和所需输出的更多详细信息可能会有所帮助。
-
当然。谢谢。所以我希望使用他们的 API 从 Coursera 抓取所有课程。所以我在 API 上运行 curl 命令来获取 JSON,默认返回 100 门课程。希望有帮助。如果您需要更具体的内容,请告诉我。
标签: python-3.x curl web-scraping beautifulsoup web-crawler