【发布时间】:2015-12-20 21:10:49
【问题描述】:
我正在用 C# 构建一个网络抓取工具,用于处理代理和大量请求。这些页面是通过 ConnectionManager 类加载的,该类抓取一个代理并使用随机代理重试加载该页面,直到页面正确加载。
平均而言,单个任务需要 100 到 300 个请求,为了加快处理速度,我设计了使用多线程同时下载网页的方法。
public Review[] getReviewsMultithreaded(int reviewCount)
{
ArrayList reviewList = new ArrayList();
int currentIndex = 0;
int currentPage = 1;
int totalPages = (reviewCount / 10) + 1;
bool threadHasMoreWork = true;
Object pageLock = new Object();
Thread[] threads = new Thread[Program.maxScraperThreads];
for(int i = 0; i < Program.maxScraperThreads; i++)
{
threads[i] = (new Thread(() =>
{
while (threadHasMoreWork)
{
HtmlDocument doc;
lock(pageLock)
{
if (currentPage <= totalPages)
{
string builtString = "http://www.example.com/reviews/" + _ID + "?pageNumber=" + currentPage;
//Log.WriteLine(builtString);
currentPage++;
doc = Program.conManager.loadDocument(builtString);
}
else
{
threadHasMoreWork = false;
continue;
}
}
try
{
//Get info from page and add to list
reviewList.Add(cRev);
}
Log.WriteLine(_asin + " reviews scraped: " + reviewList.Count);
}
catch (Exception ex) { continue; }
}
}));
threads[i].Start();
}
bool threadsAreRunning = true;
while(threadsAreRunning) //this is in a separate thread itself, so as not to interrupt the GUI
{
threadsAreRunning = false;
foreach (Thread t in threads)
if (t.IsAlive)
{
threadsAreRunning = true;
Thread.Sleep(2000);
}
}
//flatten the arraylist to a primitive
return reviewArray;
}
但是,我注意到请求仍然主要是一次处理一个,因此该方法并没有比以前快多少。锁会导致问题吗? ConnectionManager 是在一个对象中实例化的,并且每个线程都从同一个对象调用 loadDocument 吗?
【问题讨论】:
标签: c# multithreading