【问题标题】:How to speed up parsing using BeautifulSoup?如何使用 BeautifulSoup 加快解析速度?
【发布时间】:2020-03-22 01:13:37
【问题描述】:

我想制作韩国音乐节的列表,所以我尝试爬取一个销售音乐节门票的网站:

import requests
from bs4 import BeautifulSoup

INTERPARK_BASE_URL = 'http://ticket.interpark.com'

# Festival List Page
req = requests.get('http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes')
html = req.text
soup = BeautifulSoup(html, 'lxml')
for title_raw in soup.find_all('span', class_='fw_bold'):
    title = str(title_raw.find('a').text)
    url_raw = str(title_raw.find('a').get('href'))
    url = INTERPARK_BASE_URL + url_raw

    # Detail Page
    req_detail = requests.get(url)
    html_detail = req_detail.text
    soup_detail = BeautifulSoup(html_detail, 'lxml')
    details_1 = soup_detail.find('table', class_='table_goods_info')
    details_2 = soup_detail.find('ul', class_='info_Lst')
    image = soup_detail.find('div', class_='poster')

    singers = str(details_1.find_all('td')[4].text)
    place = str(details_1.find_all('td')[5].text)
    date_text = str(details_2.find('span').text)
    image_url = str(image.find('img').get('src'))

    print(title)
    print(url)
    print(singers)
    print(place)
    print(date_text)
    print(image_url)

我用for循环浏览了列表页中的所有详情页,但是加载每个详情页太慢了。

如何加快我的代码速度?

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:
    import requests
    from bs4 import BeautifulSoup
    import json
    from datetime import datetime as dt
    import csv
    
    
    def Soup(content):
        soup = BeautifulSoup(content, 'html.parser')
        return soup
    
    
    def Main(url):
        r = requests.get(url)
        soup = Soup(r.content)
        spans = soup.findAll('span', class_='fw_bold')
        links = [f"{url[:27]}{span.a['href']}" for span in spans]
        return links
    
    
    def Parent():
        links = Main(
            "http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes")
        with open("result.csv", 'w', newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Singers", "Location", "Date", "ImageUrl"])
            with requests.Session() as req:
                for link in links:
                    r = req.get(link)
                    soup = Soup(r.content)
                    script = json.loads(
                        soup.find("script", type="application/ld+json").text)
                    name = script["name"]
                    print(f"Extracting: {name}")
                    singers = script["performer"]["name"]
                    location = script["location"]["name"]
                    datelist = list(script.values())[3:5]
                    datest = []
                    image = script["image"]
                    for date in datelist:
                        date = dt.strptime(date,
                                           '%Y%m%d').strftime('%d-%m-%Y')
                        datest.append(date)
                    writer.writerow(
                        [name, singers, location, " : ".join(datest), *image])
    
    
    Parent()
    

    Run&Check-Output-Online

    View-Output

    【讨论】:

      猜你喜欢
      • 2012-12-16
      • 1970-01-01
      • 2017-04-24
      • 2015-05-23
      • 1970-01-01
      • 2019-07-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多