【发布时间】:2018-02-02 16:56:06
【问题描述】:
我试图从这个website 中抓取数据。我需要单击每个公司名称,然后提取右侧显示的数据。我无法通过正常请求来实现它,不得不使用会话来管理 cookie。有了请求和 BeautifulSoup,我会这样做
import requests
from bs4 import BeautifulSoup
import re
start_url = r"http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.jsp?cmdsan=ALL&tipoqry=ALL&mostrar_msg=SI"
s = requests.Session()
response=s.post(start_url)
soup = BeautifulSoup(response.text)
links = soup.find_all("a", {"onclick":pattern})
onclicks = [link["onclick"] for link in links]
for element in onclicks[:10]:
expe = re.search(string=element, pattern=r"\d+/\d+").group(0)
r = s.post(url="http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/FichaSinTabla.jsp",
data={"expe":expe,
"tipo":"1",
"persona":"3"}).text
soup = BeautifulSoup(r)
something = soup.find("p", {"class":"normal"})
print(something)
现在我想知道在scrapy中是否有可能:
class Spider:
def get_expe():
#get the list of expe
def make_requests():
#use the same session and make post requests for each expe
def parse():
#extract the data
当然,我不希望你为我写蜘蛛。任何帮助弄清楚如何在会话中使用相同的 cookie 将不胜感激。
【问题讨论】:
标签: python web-scraping beautifulsoup scrapy python-requests