按照论坛中的链接使用 BS4 抓取线程（评论）答案

【问题标题】：Follow link in forum to scrape thread (comments) using BS4按照论坛中的链接使用 BS4 抓取线程（评论）
【发布时间】：2020-05-07 23:12:02
【问题描述】：

我有一个有 3 个主题的论坛。我正在尝试抓取所有三个帖子中的数据。所以我需要按照每个帖子的href链接并抓取数据。这给了我一个错误，我不确定我错了什么......

import csv
import time
from bs4 import BeautifulSoup
import requests

source = requests.get('https://mainforum.com').text

soup = BeautifulSoup(source, 'lxml')

#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
    thread_name = threads.text
    thread_link = threads.a.get('href')# there are three threads and this gets all 3 links
    print (thread_link)

其余代码是我遇到问题的地方？

# request the individual thread links
for follow_link in thread_link:
    response = requests.get(follow_link)

    #parse thread link
    soup= BeautifulSoup(response, 'lxml')

    #print Data
    for p in soup.find_all('p'):
        print(p)

【问题讨论】：

亲爱的 Blake - 如果您发布完整的代码，这将有助于完全理解和掌握。这可能会帮助（尤其是我）在这里所有学习的人扩展见解和理解。 - 提前致谢 - 你的零
@zero 你是什么意思？我错过了什么吗？
它是否成功导航到其他链接？如果打印整个 html 文档会发生什么？
@TenaciousB 不，任何链接都不行...你做一次你得到它......我可以打印href很好（代码的顶部），就是这样......我几乎用那个循环写了每个链接，这可能有点问题，但有些我可以稍后处理......我现在需要的是它至少导航到其中一个链接......我得到的错误是：requests.exceptions.MissingSchema：无效的 URL 'h'：没有提供架构。也许你的意思是 http://h？
您可能在response = requests.get(follow_link) 中缺少.text

标签： python beautifulsoup

【解决方案1】：

至于您的架构错误...

您收到架构错误是因为您一遍又一遍地覆盖一个链接。然后您尝试调用该链接，就好像它是链接列表一样。此时它是一个字符串，您只需遍历字符（以 h 开头）因此出现错误。

请看这里：requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

关于一般查询以及如何解决此类问题...

如果我这样做，流程将如下所示：

获取三个href（类似于您已经完成的）
使用单独抓取线程 href 并返回您希望它们返回的任何内容的函数
在任何地方保存/附加返回的信息。
重复

可能是这样的

import csv
import time
from bs4 import BeautifulSoup
import requests

source = requests.get('https://mainforum.com')

soup = BeautifulSoup(source.content, 'lxml')

all_thread_info = []

def scrape_thread_link(href):
    response = requests.get(href)

    #parse thread link
    soup= BeautifulSoup(response.content, 'lxml')

    #return data
    return [p.text for p in soup.find_all('p')]

#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
    this_thread_info = {}
    this_thread_info["thread_name"] = threads.text
    this_thread_info["thread_link"] = threads.a.get('href')
    this_thread_info["thread_data"] = scrape_thread_link(this_thread_info["thread_link"])
    all_thread_info.append(this_thread_info)

print(all_thread_info)

原始问题中有很多未指定的内容，因此我做了一些假设。理想情况下，您可以看到要点。

另外请注意，我更喜欢使用 response 的 .content 而不是 .text。

【讨论】：

嘿，谢谢，它似乎工作正常，做了一些调整。适用于 1 个链接和三个链接。希望您就我是否还可以提供反馈意见？

【解决方案2】：

@Darien Schettler 我对代码进行了一些更改/调整，如果我在某个地方搞砸了，我很想听听？

all_thread_info = []

def scrape_thread_link(href):
    response = requests.get(href)
    soup= BeautifulSoup(response.content, 'lxml')

    for Thread in soup.find_all(id= 'discussionReplies'):
        Thread_Name = Thread.find_all('div', class_='xg_user_generated')
        for Posts in Thread_Name:
            print(Posts.text)


for threads in soup.find_all('p', class_= 'small'):
    thread_name = threads.text
    thread_link = threads.a.get('href')
    thread_data = scrape_thread_link(thread_link)
    all_thread_info.append(thread_data)

【讨论】：

您在 scrape_thread_link 函数中缺少返回语句。我认为您想创建一个列表并将Posts.text 附加到该列表中。然后，您可以返回该列表。然后你会使用all_thread_info.extend(thread_data) 而不是append。另请注意，您应该命名第一个字母大写的变量。有关命名约定的更多信息，请参见此处 - realpython.com/python-pep8 如果这没有意义，请提出另一个问题并将该问题的链接发布为对此评论的回复。我会回答这个新问题。
另外，您不应将其他问题发布为原始问题的“答案”。他们最终会被我的版主删除。只需创建一个新问题，您就可以在评论中引用它。