【问题标题】:Can't arrange and print some fields from a webpage in some customized way无法以某种自定义方式排列和打印网页中的某些字段
【发布时间】:2020-07-26 18:57:07
【问题描述】:

我创建了一个脚本来解析来自this webpagemovie nameall castProduced byCasting By。我可以从该页面解析上述字段。但是,当考虑到这四个项目时,我不能做的是以某种定制的方式排列和打印项目。当我只包含movie namecast 时,到目前为止我编写的脚本可以完全按照我想要的方式打印项目。我希望包括Produced byCasting By 以及您在this image 中看到的内容。

到目前为止我已经尝试过:

import requests
from bs4 import BeautifulSoup

link = 'https://www.imdb.com/title/tt0068646/fullcredits?ref_=tt_cl_sm#cast'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    movie_name = soup.select_one("h3[itemprop='name'] > a").get_text(strip=True)
    for item in soup.select("h4#cast + table.cast_list tr:has(:not(.castlist_label))"):
        performer = item.select_one("td:not(.primary_photo) > a[href^='/name/']").get_text(strip=True)
        character = ' '.join(item.select_one("td.character").text.split())
        print(movie_name,performer,character)

我得到的输出(movie namecast):

The Godfather Marlon Brando Don Vito Corleone
The Godfather Al Pacino Michael Corleone
The Godfather James Caan Sonny Corleone
The Godfather Richard S. Castellano Clemenza (as Richard Castellano)
The Godfather Robert Duvall Tom Hagen
The Godfather Sterling Hayden Capt. McCluskey
The Godfather John Marley Jack Woltz
and so on----------------------

我希望在上面打印的底部添加以下结果(取自您在图像中看到的两个字段 Produced byCasting By):

The Godfather Gray Frederickson associate producer
The Godfather Al Ruddy producer (as Albert S. Ruddy) (produced by)
The Godfather Robert Evans studio executive (uncredited)
The Godfather Louis DiGiaimo (casting)
The Godfather Andrea Eastman (casting)
The Godfather Fred Roos (casting)

如何让脚本按照我上面显示的方式打印字段?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:
    import requests
    from bs4 import BeautifulSoup
    
    link = 'https://www.imdb.com/title/tt0068646/fullcredits?ref_=tt_cl_sm#cast'
    
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        movie_name = soup.select_one("h3[itemprop='name'] > a").get_text(strip=True)
        for item in soup.select("h4#cast + table.cast_list tr:has(:not(.castlist_label))"):
            performer = item.select_one("td:not(.primary_photo) > a[href^='/name/']").get_text(strip=True)
            character = ' '.join(item.select_one("td.character").text.split())
            print(movie_name,performer,character)
        for row in soup.select('h4:contains("Produced by") + table tr'):
            name = row.select_one('.name').get_text(strip=True)
            credit = row.select_one('.credit').get_text(strip=True)
            print(movie_name, name, credit)
        for row in soup.select('h4:contains("Casting By") + table tr'):
            name = row.select_one('.name').get_text(strip=True)
            credit = row.select_one('.credit').get_text(strip=True)
            print(movie_name, name, credit)
    

    打印:

    ...
    Krstný Otec Matthew Vlahakis Clemenza's Son (uncredited)
    Krstný Otec Conrad Yama Fruit Vendor (uncredited)
    Krstný Otec Gray Frederickson associate producer
    Krstný Otec Al Ruddy producer (as Albert S. Ruddy) (produced by)
    Krstný Otec Robert Evans studio executive (uncredited)
    Krstný Otec Louis DiGiaimo (casting)
    Krstný Otec Andrea Eastman (casting)
    Krstný Otec Fred Roos (casting)
    

    注意:Krstný Otec 在斯洛伐克语中的意思是 Godfather(我得到了斯洛伐克语版本的 HTML,因为我的国家的 IP)。

    【讨论】:

    • 感谢 Andrej Kesely 的回答。这当然有帮助,但问题是我希望同时打印它们,而不是在不同的循环中单独使用 print。谢谢。
    • @robots.txt 可能我没看懂,要怎么同时打印呢?在您的问题中,您已声明您希望这些结果位于打印的底部...
    • 我已经整理好了。感谢您的帮助。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-07-13
    • 1970-01-01
    • 2016-06-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多