【问题标题】:Shifting pythone code written in selenium to scrapy or requests将用 selenium 编写的 pythone 代码转换为 scrapy 或 requests
【发布时间】:2020-10-27 07:07:07
【问题描述】:

我在 selenium 中有返回码。它工作正常。它报废门户并提取表中的数据。但现在我正试图转向scrapy或requests。 我尝试学习两者,但失败了。硒结构符合我的想法。我需要很长时间才能理解 requests 或 scrappy 的基础知识然后使用它们。捷径是获得一些关于如何直接与当前代码相关的提示。

我为什么要换班? - 我发布了代码以寻求重构代码的建议 (here)。其中两个 cmets 建议我转向请求。这引发了努力。然后经过一些初步搜索,我意识到,我可以避免使用 selenium,而 requests 或 scrappy 可以为我节省大量时间。

我检查了here。但这并不能解决我的问题。

有人可以帮忙吗?提前致谢。

代码(包括 URL)-

from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, \
    TimeoutException, StaleElementReferenceException, WebDriverException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from FIR_logging import logger
import os
import time
import pandas as pd


# base function

def get_url(some_url):
    while True:
        try:
            driver.get(some_url)
            break
        except WebDriverException:
            time.sleep(60)
            continue
    driver.refresh()



# Some constants:

URL = r'https://www.mhpolice.maharashtra.gov.in/Citizen/MH/PublishedFIRs.aspx'
options = FirefoxOptions()
options.add_argument("--headless")
options.add_argument("--private-window")
driver = webdriver.Firefox(options=options)
get_url(URL)
time.sleep(10)

Download_Directory = r'/some_directory/raw_footage7'

COLUMNS = ['Sr.No.', 'State', 'District', 'Police Station', 'Year', 'FIR No.', 'Registration Date', 'FIR No',
           'Sections']

ALL_Districts = ['AKOLA', 'AMRAVATI CITY', 'AMRAVATI RURAL', 'AURANGABAD CITY',
                 'AURANGABAD RURAL', 'BEED', 'BHANDARA', 'BRIHAN MUMBAI CITY', 'BULDHANA',
                 'CHANDRAPUR', 'DHULE', 'GADCHIROLI', 'GONDIA', 'HINGOLI', 'JALGAON', 'JALNA',
                 'KOLHAPUR', 'LATUR', 'NAGPUR CITY', 'NAGPUR RURAL', 'NANDED', 'NANDURBAR',
                 'NASHIK CITY', 'NASHIK RURAL', 'NAVI MUMBAI', 'OSMANABAD', 'PALGHAR', 'PARBHANI',
                 'PIMPRI-CHINCHWAD', 'PUNE CITY', 'PUNE RURAL', 'RAIGAD', 'RAILWAY AURANGABAD',
                 'RAILWAY MUMBAI', 'RAILWAY NAGPUR', 'RAILWAY PUNE', 'RATNAGIRI', 'SANGLI', 'SATARA',
                 'SINDHUDURG', 'SOLAPUR CITY', 'SOLAPUR RURAL', 'THANE CITY', 'THANE RURAL', 'WARDHA',
                 'WASHIM', 'YAVATMAL']


# other functions


def district_selection(name):
    dist_list = Select(driver.find_element_by_css_selector(
        "#ContentPlaceHolder1_ddlDistrict"))
    dist_list_options = dist_list.options
    names = [o.get_attribute("text")
             for o in dist_list.options if o.get_attribute("text") not in (
                 'Select')]
    if name not in names:
        logger.info(f"{name} is not in list")
        return False
    dist_list.select_by_visible_text(name)
    time.sleep(8)


def enter_date(date):
    # enters start as well as end dates with "action chains."
    WebDriverWait(driver, 160).until(
        EC.presence_of_element_located((By.CSS_SELECTOR,
                                        '#ContentPlaceHolder1_txtDateOfRegistrationFrom')))
    from_date_field = driver.find_element_by_css_selector(
        '#ContentPlaceHolder1_txtDateOfRegistrationFrom')

    to_date_field = driver.find_element_by_css_selector(
        '#ContentPlaceHolder1_txtDateOfRegistrationTo')

    ActionChains(driver).click(from_date_field).send_keys(
        date).move_to_element(to_date_field).click().send_keys(
        date).perform()

    logger.info(f'date entered: {date}')


def search():
    driver.find_element_by_css_selector('#ContentPlaceHolder1_btnSearch').click()


def number_of_records():
    """captures the text indicating number of records.
    converts it to integer. if 0 returns and appends name of district to the list
    if page is not loaded. it tries one more time for 15 secs."""
    time_counter = 1
    while time_counter < 19:
        try:
            records_number = driver.find_element_by_css_selector(
                '#ContentPlaceHolder1_lbltotalrecord').text
            if records_number == '':
                time.sleep(1)
                continue
            else:
                records_number = int(records_number)
            if records_number != 0:
                logger.info(f"{district}: {records_number}")

                return records_number
            else:
                logger.info(f"no records @ {district}")
                return False
        except (NoSuchElementException, TimeoutException, StaleElementReferenceException):
            logger.info("page is not loaded")
            time_counter += 1
            continue


def extract_table_current(name, single):
    # entire table of record to be taken to the list.
    soup = BS(driver.page_source, 'html.parser')
    main_table = soup.find("table", {"id": "ContentPlaceHolder1_gdvDeadBody"})
    time_counter = 1
    while main_table is None:
        if time_counter < 16:
            logger.info(f"the table did not load @ {name}")
            time_counter += 1
        else:
            logger.info(f"the table did not load @ {name}."
                        f"stopped trying")
            return
    links_for_pages = driver.find_elements_by_css_selector('.gridPager a')
    rows = main_table.find_all("tr")
    if links_for_pages is None:

        for row in rows:
            time.sleep(8)
            if '...' not in row.text:
                cells = row.find_all('td')
                cells = cells[0:9]  # drop the last column
                # store data in list
                single.append([cell.text for cell in cells])
    else:
        for row in rows[0:(len(rows)) - 2]:
            time.sleep(8)
            cells = row.find_all('td')
            cells = cells[0:9]  # drop the last column

            # store data in list
            single.append([cell.text for cell in cells])


def next_page(name, data):
    # check if any link to next page is available
    # iterate every page.
    try:
        driver.find_element_by_css_selector('.gridPager a')
    except NoSuchElementException:
        return False
    links_for_pages = driver.find_elements_by_css_selector('.gridPager a')
    for page in range(len(links_for_pages)):
        # new list, to by pass stale element exception
        links_for_pages_new = driver.find_elements_by_css_selector('.gridPager a')
        # do not click on link for new page slot
        if links_for_pages_new[page].text != '...':
            links_for_pages_new[page].click()
            # if this can be replaced with some other wait method to save the time
            time.sleep(8)
            extract_table_current(name, data)


def second_page_slot():
    # find specific link for going to page 11 and click.
    try:
        link_for_page_slot = driver.find_element_by_link_text('...')
        link_for_page_slot.click()
    except NoSuchElementException:
        return False


# main code

page_data = []

time.sleep(5)
view = Select(driver.find_element_by_css_selector(
    '#ContentPlaceHolder1_ucRecordView_ddlPageSize'))
view.select_by_value('50')
driver.close()
for district in ALL_Districts:

    b = "06"
    c = "2020"
    district_directory = os.path.join(Download_Directory, f'{district}{b}{c}')
    if not os.path.exists(district_directory):
        os.mkdir(district_directory)
    for i in range(1, 30):
        # reoping the page to wipe out the catch.
        options = FirefoxOptions()
        options.add_argument("--headless")
        options.add_argument("--private-window")
        driver = webdriver.Firefox(options=options)
        get_url(URL)
        # entering date and assuring that 01 to 09 is entered correctly
        if i < 10:
            i = f'{str("0")}{str(i)}'
        date_from = str(i) + b + c
        enter_date(date_from)
        # select district
        district_selection(district)
        time.sleep(3)
        # start the search
        search()
        time.sleep(7)
        if not number_of_records():
            continue
        extract_table_current(district, page_data)
        time.sleep(3)
        if not next_page(district, page_data):
            district_data = pd.DataFrame(page_data, columns=COLUMNS)
            district_data.to_csv(os.path.join(district_directory, f'{district}{i}{b}{c}.csv'))
            continue
        extract_table_current(district, page_data)
        district_data = pd.DataFrame(page_data, columns=COLUMNS)
        district_data.to_csv(os.path.join(district_directory, f'{district}{i}{b}{c}.csv'))
        driver.close()

【问题讨论】:

    标签: python python-3.x selenium web-scraping scrapy


    【解决方案1】:

    Request 是一个非常漂亮和简单但功能强大的包。当你学会了它,你会很感激 :) 你可以使用 request 来浏览页面,有时甚至可以登录或发送消息。

    我不知道 scrappy,但我一直在使用 BeautifulSoup,而且它的学习也相当简单,您只需从请求中获取数据“汤”,然后使用 BS 过滤您的数据。

    我对你的建议是从头开始,一步一步。

    从获取您的页面开始,然后一点一点地获取您的数据:)

    page = requests.get('https://www.mhpolice.maharashtra.gov.in/Citizen/MH/PublishedFIRs.aspx')
    soup = BeautifulSoup(page.text, 'lxml')
    

    【讨论】:

    • 在这里尝试并遇到了问题:requests.exceptions.SSLError: HTTPSConnectionPool(host='www.mhpolice.maharashtra.gov.in', port=443): Max retries exceeded with url: /Citizen/MH /PublishedFIRs.aspx(由 SSLError(SSLCertVerificationError(“主机名 'www.mhpolice.maharashtra.gov.in' 与 'citizen.mahapolice.gov.in' 不匹配'”引起)
    • 看起来他们正在监视一个 IP 地址向服务器发出的请求数,并在多次请求后阻止您。可能您需要等待一段时间才能阻止您的 ip 停止。
    • 我应该尝试使用更改IP地址等方法来逃避阻止吗?有没有这样的选择?
    • 是的,当然可以,但我认为没有必要。尽量不要在 5 分钟内发送 100 个请求(太多) :) 如果您需要使用代理,互联网上有很多可用的代理。但正如我所说,我认为这超出了您需要的范围。
    • 我只是运行代码,你给出的简单代码,一旦我得到这个错误!
    猜你喜欢
    • 1970-01-01
    • 2018-09-25
    • 1970-01-01
    • 2017-06-26
    • 2019-11-06
    • 1970-01-01
    • 1970-01-01
    • 2013-01-22
    • 2014-09-01
    相关资源
    最近更新 更多