【问题标题】:Using python to scrape ASP.NET site with id in url使用 python 在 url 中使用 id 抓取 ASP.NET 站点
【发布时间】:2016-01-01 17:32:34
【问题描述】:

我正在尝试使用 Python 请求来抓取此 ASP.NET 网站的搜索结果以发送 POST 请求。即使我使用 GET 请求来获取 requestverificationtoken 并将其包含在我的标头中,我也会得到以下回复:

{"Token":"Y2VgsmEAAwA","Link":"/search/Y2VgsmEAAwA/"}

这不是有效的链接。这是我的 POST 请求中包含的未定义到达数据或区域的总搜索结果。我错过了什么?我该向谁抓取这样一个为 URL 生成(会话?)ID 的网站?

在此先感谢大家!

我的python脚本:

import json
import requests
from bs4 import BeautifulSoup

r = requests.Session()

# GET request  
gr = r.get("http://www.feline.dk")
bsObj = BeautifulSoup(gr.text,"html.parser")
auth_string = bsObj.find("input", {"name": "__RequestVerificationToken"})['value']
#print(auth_string)
#print(gr.url)

# POST request
search_request = {
    "Geography.Geography":"Danmark",
    "Geography.GeographyLong=":"Danmark (Ferieområde)",
    "Geography.Id":"da509992-0830-44bd-869d-0270ba74ff62",
    "Geography.SuggestionId": "",
    "Period.Arrival":"16-1-2016",
    "Period.Duration":7,
    "Period.ArrivalCorrection":"false",
    "Price.MinPrice":None,
    "Price.MaxPrice":None,
    "Price.MinDiscountPercentage":None,
    "Accommodation.MinPersonNumber":None,
    "Accommodation.MinBedrooms":None,
    "Accommodation.NumberOfPets":None,
    "Accommodation.MaxDistanceWater":None,
    "Accommodation.MaxDistanceShopping":None,
    "Facilities.SwimmingPool":"false",
    "Facilities.Whirlpool":"false",
    "Facilities.Sauna":"false",
    "Facilities.InternetAccess":"false",
    "Facilities.SatelliteCableTV":"false",
    "Facilities.FireplaceStove":"false",
    "Facilities.Dishwasher":"false",
    "Facilities.WashingMachine":"false",
    "Facilities.TumblerDryer":"false",
    "update":"true"
    }


payload = { 
    "searchRequestJson": json.dumps(search_request),
    }


header ={
"Accept":"application/json, text/html, */*; q=0.01",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"da-DK,da;q=0.8,en-US;q=0.6,en;q=0.4",
"Connection":"keep-alive",
"Content-Length":"720",
"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
"Cookie":"ASP.NET_SessionId=ebkmy3bzorzm2145iwj3bxnq; __RequestVerificationToken=" + auth_string + "; aid=382a95aab250435192664e80f4d44e0f; cid=google-dk; popout=hidden; __utmt=1; __utma=1.637664197.1451565630.1451638089.1451643956.3; __utmb=1.7.10.1451643956; __utmc=1; __utmz=1.1451565630.1.1.utmgclid=CMWOra2PhsoCFQkMcwod4KALDQ|utmccn=(not%20set)|utmcmd=(not%20set)|utmctr=(not%20provided); BNI_Feline.Web.FelineHolidays=0000000000000000000000009b84f30a00000000",
"Host":"www.feline.dk",
"Origin":"http://www.feline.dk",
#"Referer":"http://www.feline.dk/search/Y2WZNDPglgHHXpe2uUwFu0r-JzExMYi6yif5KNswMDBwMDAAAA/",
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
 }

gr = r.post(
    url = 'http://www.feline.dk/search',
    data = payload,
    headers = header
    )

#print(gr.url)
bsObj = BeautifulSoup(gr.text,"html.parser")
print(bsObj)

【问题讨论】:

  • 有什么帮助吗?谢谢!

标签: python asp.net python-3.x web-scraping python-requests


【解决方案1】:

经过多次尝试,我发现您的搜索请求格式错误(需要是 URL 编码而不是 JSON),并且 header 中的 cookie 信息被覆盖(让 session 起作用)。

我这样简化了代码,得到了想要的结果

r = requests.Session()

# GET request  
gr = r.get("http://www.feline.dk")
bsObj = BeautifulSoup(gr.text,"html.parser")
auth_string = bsObj.find("input", {"name": "__RequestVerificationToken"})['value']

# POST request
search_request = "Geography.Geography=Hou&Geography.GeographyLong=Hou%2C+Danmark+(Ferieomr%C3%A5de)&Geography.Id=847fcbc5-0795-4396-9318-01e638f3b0f6&Geography.SuggestionId=&Period.Arrival=&Period.Duration=7&Period.ArrivalCorrection=False&Price.MinPrice=&Price.MaxPrice=&Price.MinDiscountPercentage=&Accommodation.MinPersonNumber=&Accommodation.MinBedrooms=&Accommodation.NumberOfPets=&Accommodation.MaxDistanceWater=&Accommodation.MaxDistanceShopping=&Facilities.SwimmingPool=false&Facilities.Whirlpool=false&Facilities.Sauna=false&Facilities.InternetAccess=false&Facilities.SatelliteCableTV=false&Facilities.FireplaceStove=false&Facilities.Dishwasher=false&Facilities.WashingMachine=false&Facilities.TumblerDryer=false"

gr = r.post(
    url = 'http://www.feline.dk/search/',
    data = search_request,
    headers = {'Content-Type': 'application/x-www-form-urlencoded'}
)

print(gr.url)

结果:

http://www.feline.dk/search/Y2U5erq-ZSr7NOfJEozPLD5v-MZkw8DAwMHAAAA/

【讨论】:

  • 非常感谢@Gaetan。我觉得很蠢——我认为问题要复杂得多。再次感谢您。
【解决方案2】:

感谢 Kantium 的回答,就我而言,我发现 RequestVerificationToken 实际上是在页面内的 JS 脚本中生成的。

1 - 调用生成代码的第一个页面,在我的例子中,它在 HTML 中返回类似这样的内容:

<script>
    Sys.Net.WebRequestManager.add_invokingRequest(function (sender, networkRequestEventArgs) {
        var request = networkRequestEventArgs.get_webRequest();
        var headers = request.get_headers();
        headers['RequestVerificationToken'] = '546bd932b91b4cdba97335574a263e47';
    });
  
    $.ajaxSetup({
        beforeSend: function (xhr) {
            xhr.setRequestHeader("RequestVerificationToken", '546bd932b91b4cdba97335574a263e47');
        },
        complete: function (result) {
            console.log(result);
        },
    });

</script>

2 - 获取 RequestVerificationToken 代码,然后将其与来自 set-cookie 的 cookie 一起添加到您的请求中。

 let resp_setcookie = response.headers["set-cookie"];
 let rege = new RegExp(/(?:RequestVerificationToken", ')(\S*)'/);
 let token = rege.exec(response.body)[1];

我实际上将它们存储在一个全局变量中,稍后在我的 Nodejs 请求中我会将其添加到请求对象中:

headers.Cookie = gCookies.cookie;
headers.RequestVerificationToken = gCookies.token;

所以最终请求看起来像这样:

请记住,您可以使用以下命令监控发送的请求:

require("request-debug")(requestpromise);

祝你好运!

【讨论】:

    猜你喜欢
    • 2011-02-18
    • 1970-01-01
    • 2014-09-08
    • 1970-01-01
    • 1970-01-01
    • 2015-06-09
    • 1970-01-01
    • 1970-01-01
    • 2020-09-28
    相关资源
    最近更新 更多