我在抓取多个 URL 时遇到问题答案

【问题标题】：I'm having trouble scraping multiple URL's我在抓取多个 URL 时遇到问题
【发布时间】：2021-03-15 21:19:42
【问题描述】：

我在抓取多个网址时遇到问题。基本上我只能为一种类型运行它，但第二种我包含它停止工作的其他链接。

目标是获取数据并将其放入包含电影标题、网址和流派的 csv 文件中。任何帮助将不胜感激！

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,"html.parser")

containers = page_soup.findAll("li",{"class":"nm-content-horizontal-row-item"})


# name the output file to write to local disk
out_filename = "netflixaction2.csv"
# header of csv file to be written
headers = "Movie_Name, Movie_ID \n"

# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)



for container in containers:
    
    title_container = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
    title_container = title_container[0].text

    movieid = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
    movieid = movieid[0].attrs['href']

    print("Movie Name: " + title_container, "\n")
    print("Movie ID: " , movieid, "\n")

    f.write(title_container + ", " + movieid + "\n")
f.close()  # Close the file

【问题讨论】：

因为您使用的是 open(out_filename, "w") ，它会覆盖您在文件中写入的内容。您可以尝试更改为“a”而不是“w”，看看是否可以解决？如果是这样，那么我将作为解决方案发布:)
不幸的是它没有。我遇到了这个错误： Traceback（最近一次调用最后一次）：文件“C:\Users\tghan\autoscrape3.py”，第 6 行，在 uClient = uReq(my_url) 文件“C:\Users\ tghan\Anaconda3\lib\urllib\request.py”，第 222 行，在 urlopen 返回 opener.open(url, data, timeout) 文件“C:\Users\tghan\Anaconda3\lib\urllib\request.py”，行516, in open req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout' [Finished in 0.3s]

标签： python beautifulsoup screen-scraping

【解决方案1】：

您收到错误的原因是您尝试对列表执行 GET 请求。

my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

uClient = uReq(my_url)

我建议在这里做的是遍历每个链接等：

my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

for link in my_url:
    uClient = uReq(link)
    page_html = uClient.read()
    ....

要提一下，如果您只是为循环应用代码，它将覆盖您的 f.write 函数。您需要做的是：

新编辑：

import csv

import requests
from bs4 import BeautifulSoup as soup

# All given URLS
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']

# Create and open CSV file
with open("netflixaction2.csv", 'w', encoding='utf-8') as csv_file:
    # Headers for CSV
    headers_for_csv = ['Movie Name', 'Movie Link']

    # Small function for csv DictWriter
    csv_writer = csv.DictWriter(csv_file, delimiter=',', lineterminator='\n', fieldnames=headers_for_csv)
    csv_writer.writeheader()

    # We need to loop through each URL from the list
    for link in my_url:

        # Do a simple GET requests with the URL
        response = requests.get(link)

        page_soup = soup(response.text, "html.parser")

        # Find all nm-content-horizontal-row-item
        containers = page_soup.findAll("li", {"class": "nm-content-horizontal-row-item"})

        # Loop through each found "li"
        for container in containers:
            movie_name = container.text.strip()
            movie_link = container.find("a")['href']

            print(f"Movie Name: {movie_name} | Movie link: {movie_link}")

            # Write to CSV
            csv_writer.writerow({
                'Movie Name': movie_name,
                'Movie Link': movie_link,
            })

# Close the file
csv_file.close()

这应该是您的解决方案 :) 如果我遗漏了什么，请随时发表评论！

【讨论】：

谢谢！但是，由此产生的问题是它只从 my_url 中的第一个 url 获取数据，而忽略任何后续 url。
您好，由于您没有提供预期结果，我很难知道预期结果。但我认为您的问题可能与您要抓取的内容有关。如果您可以提供更多信息等您试图抓取的内容和预期结果，这将对我们所有试图提供帮助的人有所帮助:)
当然，目前只有一个链接，结果是我会将 netflix.com/genre/browse/1365（动作类型）的内容废弃并放入带有标题的 csv 文件在这种情况下，“电影名称”和“电影 ID”ID 将是 url。我想做的是同样的事情，但是不必更改 (1365) 并单独运行，我希望通过从头开始包含所有 url 来一次运行每个 url，即 www.netflix.com/browse /genre/1365 www.netflix.com/browse/genre/7424 等
@AnthonyGhanime 我现在更新了一个新代码，并且比您以前的代码更简单，如果您有什么想不明白的地方，请随时了解。我还不确定你在看什么。特别是你说的：What I'm looking to do is that same thing however instead of having to change the (1365) and run separately - 还有更多工作要做，你需要添加线程并锁定 CSV，这样你就不会互相覆盖等等等等。如果是，请不要忘记把它作为答案你正在寻找的东西:)