如何使用 python 和美丽的汤刮这张桌子？答案

【问题标题】：How to scrape this table using python and beautiful soup?如何使用 python 和美丽的汤刮这张桌子？
【发布时间】：2020-02-24 05:08:19
【问题描述】：

我正在尝试将 https://m.the-numbers.com/market/2018/top-grossing-movies ，特别是表格抓取到 CSV 中。我正在使用 Python 和 Beautiful Soup，但我对此很陌生，并且会喜欢任何解决方案的任何提示。有哪些简单的方法可以解决这个问题？

谢谢

这是我下面的最新实验......

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('cms_scrape.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['filmTitle', 'releasDate', 'distributor', 'genre', 'gross', 'ticketsSold'])

for tbody in soup.find_all('a', class_='table-responsive'):

    filmTitle = tbody.tr.td.b.a.text
    print(filmTitle)

    csv_writer.writerow([filmTitle])

csv_file.close()

【问题讨论】：

你能和我们分享你的剧本吗？
当然，这是我当前实验的中间阶段，我尝试了 4 或 5 种不同的方法，我觉得我只是误解了一些非常基本的东西，或者遗漏了一些简单的东西。现在附上它。

标签： python web-scraping beautifulsoup

【解决方案1】：

类似下面的代码就可以完成这项工作。

关于该主题的有用链接：

import requests
from bs4 import BeautifulSoup
import csv

# Making get request
r = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies')

# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')

# Localizing table from the BS object
table_soup = soup.find('div', id='page_filling_chart').find('div', class_='table-responsive').find('table')

# Iterating through all trs in the table except the first(header) and the last two(summary) rows
movies = []
for tr in table_soup.find_all('tr')[1:-2]:
    tds = tr.find_all('td')

    # Creating dict for each row and appending it to the movies list
    movies.append({
        'rank': tds[0].text.strip(),
        'movie': tds[1].text.strip(),
        'release_date': tds[2].text.strip(),
        'distributor': tds[3].text.strip(),
        'genre': tds[4].text.strip(),
        'gross': tds[5].text.strip(),
        'tickets_sold': tds[6].text.strip(),
    })

# Writing movies list of dicts to file using csv.DictWriter
with open('movies.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=movies[0].keys())
    writer.writeheader()
    writer.writerows(movies)

【讨论】：

我喜欢这里的干净程度。我了解您所做的很多事情，但是以这种格式收集movie url 的最佳方式是什么？
您可以通过在第二列中找到a 标记并访问其href 属性来获取它：movie_link = tds[1].find('a')['href']

【解决方案2】：

假设您已经拥有source 的值，您可以这样做：

import pandas as pd
df = pd.read_html(source)[0]
df.to_csv('cms_scrape.csv', index=False)

【讨论】：