BeautifulSoup 抓取每个 <li> 中的第一个标题标签答案

【问题标题】：BeautifulSoup scrape the first title tag in each <li>BeautifulSoup 抓取每个 <li> 中的第一个标题标签
【发布时间】：2021-05-10 19:42:59
【问题描述】：

我有一些代码可以通过维基百科上的节目或电影的演员表。抓取所有演员的姓名并存储它们。我拥有的当前代码在列表中找到所有<a> 并存储它们的标题标签。目前是这样：

from bs4 import BeautifulSoup
URL = input() 
website_url = requests.get(URL).text   
section = soup.find('span', id='Cast').parent

Stars = []
for x in section.find_next('ul').find_all('a'):
    title = x.get('title')
    print (title)
    if title is not None:
        Stars.append(title)
    else:
        continue

虽然这部分有效，但有两个缺点：

如果演员没有维基百科页面超链接，它就不起作用。
它还会抓取它找到的任何其他超链接标题。例如https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull 返回['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']

有没有办法让 BeautifulSoup 刮掉每个 <li> 之后的前两个单词？或者甚至是我想要做的更好的解决方案？

【问题讨论】：

x.get('title') 返回一个字符串，因此您可以只拆分（），只选择前两个“单词”，然后加入（）。例如，title = ' '.join(title.split(' ')[:2]).

标签： python beautifulsoup

【解决方案1】：

您可以使用 css 选择器仅抓取 <li> 中的第一个 <a>：

for x in section.find_next('ul').select('li > a:nth-of-type(1)'):

示例

from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text   
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent

Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
        Stars.append(x.get('title'))
Stars

输出

['Harrison Ford',
 'Cate Blanchett',
 'Karen Allen',
 'Ray Winstone',
 'John Hurt',
 'Jim Broadbent',
 'Shia LaBeouf']

【讨论】：

【解决方案2】：

您可以使用正则表达式从 <li/> 的文本内容中获取所有名称，然后只取前两个名称，它还可以解决演员没有维基百科页面超链接的问题

import re
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)", <text_content_from_li>)

Example:

text = "Cate Blanchett as Irina Spalko, a villainous Soviet agent. Screenwriter David Koepp created the character."
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)",text)

Output:
[('Cate', 'Blanchett'), ('Irina', 'Spalko'), ('Screenwriter', 'David')]

【讨论】：

【解决方案3】：

在维基百科上的电影列表中，演员表的 html 有相当大的变化。或许可以通过 API 来获取这些信息？

例如imdb8 允许合理数量的调用，您可以将其与以下端点一起使用

https://imdb8.p.rapidapi.com/title/get-top-cast

好像还有Python IMDb API

或者选择更常规的 html。例如，如果您在列表中获取 imdb 电影 ID，您可以从 IMDb 中提取完整演员和主要演员，如下所示。为了获得更短的演员表，我将过滤掉“按字母顺序列出的演员表的其余部分”中文本“其余部分”/之后出现的行：

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

movie_ids = ['tt0367882', 'tt7126948']   
base = 'https://www.imdb.com'

with requests.Session() as s:
   
    for movie_id in movie_ids:
        link = f'https://www.imdb.com/title/{movie_id}/fullcredits?ref_=tt_cl_sm'
        # print(link)
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        print(soup.select_one('title').text)
        full_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list [href*=name]:has(img)')] 
        main_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list tr:not(:has(.castlist_label:contains(cast)) ~ tr, :has(.castlist_label:contains(cast))) [href*=name]:has(img)')]
        df_full = pd.DataFrame(full_cast, columns = ['Actor', 'Link'])
        df_main = pd.DataFrame(main_cast, columns = ['Actor', 'Link'])
        # print(df_full)
        print(df_main)

【讨论】：