golang多线程网络爬虫陷入死锁答案

【问题标题】：golang multithreaded web crawler runs into deadlockgolang多线程网络爬虫陷入死锁
【发布时间】：2022-01-09 14:09:04
【问题描述】：

我刚开始学习使用 golang 进行多线程编程，并且正在尝试使用 BFS 遍历编写多线程网络爬虫，但是我无法使代码正常工作。我得到的错误是fatal error: all goroutines are asleep - deadlock!

我将粘贴下面的代码，但让我从概念上解释它是如何工作的：

我有一个主线程（主函数本身）和 N 个工作线程。我故意选择使用具有固定数量的工作线程的 BFS 方法，因为似乎使用 DFS 方法我必须为每个要抓取的新 URL 生成一个新线程，这可能会成为上下文切换的巨大负担。

我正在使用两个渠道：

urlsToCrawl：主线程将要抓取的 URL 发送到工作线程。
urlsDiscovered：工作线程将发现的 URL 发送回主线程。

这里是代码实现，我删除了一些不相关的细节（例如如何解析html页面等）

我在这里尝试做的技巧是：我正在使用通道作为队列来做 BFS，当队列的大小为 0 时，无法知道是否是因为“A.真的没有更多要抓取的 URL”或因为“B. 一些工作线程仍在工作，因此可能很快会有更多 URL 需要抓取”。所以我引入了这个count变量，基本上每当一个新的url被发送给worker进行爬取时，count就会增加，因此当count == 0和channel为空时，这意味着“A.真的没有更多的URL了爬行”；否则，当count > 0 和频道为空时，这将意味着“B. 一些工作线程仍在工作，因此可能很快会有更多 URL 需要抓取”。

但是，正如我所提到的，这似乎不起作用，我遇到了死锁。有人能说明一下吗？谢谢！

package main

import (
    "fmt"
)

var (
    count = 0   // This tracks how many worker threads are actively working right now
)

func crawlUrl(urlsToCrawl chan string, urlsDiscovered chan Pair) {
    for url := range urlsToCrawl {
        urls := getUrls(url)  // This returns an array of string, if no URL found, it returns an empty array
        urlsDiscovered <- urls
    }
}

func main() {
    urlsToCrawl := make(chan string)
    urlsDiscovered := make(chan string[])
    i := 0
    for i < 8 {
        go crawlUrl(urlsToCrawl, urlsDiscovered)
        i++
    }
    visited := map[string]bool{"some_seed_url": true}
    count++
    urlsToCrawl <- "some_seed_url"
    for urls := range urlsDiscovered {
        count--  // One message is received by master, meaning one worker thread has finished an job item, therefore decrementing count
        for _, url := range urls {
            _, ok := visited[url]
            if ok {
                continue  // This URL has been crawled before
            }
            visited[url] = true
            count ++  // One more work item will be sent to worker, therefore first increment count
            urlsToCrawl <- url
        }
        if count == 0 {
            close(urlsDiscovered)
            close(urlsToCrawl)
            break
        }  // else some worker must be working so let's wait to see if there is new msg coming through the channel
    }
}

【问题讨论】：

阅读stackoverflow.com/search?tab=relevance&q=%5bgo%5d%20crawl

标签： multithreading go channel

【解决方案1】：

尝试集成 WaitGroup 包。

【讨论】：

【解决方案2】：

您的频道没有缓冲

urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])

因此，从通道读取或写入通道的 goroutine 将阻塞，直到另一侧的 goroutine 执行相反的操作。

所以你启动了 8 个 crawlUrl goroutines，它们在从 urlsToCrawl 读取时都会阻塞，这意味着 main 可以在阻塞之前发送 8 个 url。 crawlUrl goroutine 被阻塞，直到 main 从 urlsDiscovered 读取。因此，如果您有超过 8 个 URL，所有 goroutine 都在互相等待（死锁）。

解决此问题的方法是使用buffered channels 的容量您不太可能超过：

urlsToCrawl := make(chan string, 1000)
urlsDiscovered := make(chan string[], 100)

如果您预计在极端情况下仍可能超出通道的容量，您可以执行non-blocking operations，例如，如果通道已满而不是阻塞，您可以丢弃发现的 URL。

select {
case: urlsDiscovered <- urls:
  // on success (url written)
default:
  // channel is full, can't write without blocking
}

【讨论】：

非常感谢！所以基本上main被阻塞等待一些工作线程接收消息，而所有工作线程都被阻塞等待main线程接收爬取结果