TCLAP 使多线程程序变慢答案

【问题标题】：TCLAP makes multithreaded program slowerTCLAP 使多线程程序变慢
【发布时间】：2014-09-05 21:11:12
【问题描述】：

TCLAP 是一个 C++ 模板化的仅标头库，用于解析命令行参数。

我正在使用 TCLAP 处理多线程程序中的命令行参数：在主函数中读取参数，然后启动多个线程来处理由参数定义的任务（NLP 的一些参数任务）。

我已经开始显示线程每秒处理的字数，我发现如果我将参数硬编码到 main 中而不是使用 TCLAP 从 cli 读取它们，吞吐量是 6 倍更快！

我正在使用带有 -O2 参数的 gcc，我发现在编译期间（当不使用 TCLAP 时）不优化时，速度提高了大约 10 倍......所以似乎使用 TCLAP 以某种方式否定了部分编译器优化的优势。

这是我唯一使用 TCLAP 的主要功能，如下所示：

int main(int argc, char** argv)                                                 
{                                                                               
uint32_t mincount;                                                          
uint32_t dim;                                                               
uint32_t contexthalfwidth;                                                  
uint32_t negsamples;                                                        
uint32_t numthreads;                                                        
uint32_t randomseed;                                                        
string corpus_fname;                                                        
string output_basefname;                                                    
string vocab_fname;                                                         

Eigen::initParallel();                                                      

try {                                                                       
TCLAP::CmdLine cmd("Driver for various word embedding models", ' ', "0.1"); 
TCLAP::ValueArg<uint32_t> dimArg("d","dimension","dimension of word representations",false,300,"uint32_t");
TCLAP::ValueArg<uint32_t> mincountArg("m", "mincount", "required minimum occurrence count to be added to vocabulary",false,5,"uint32_t");
TCLAP::ValueArg<uint32_t> contexthalfwidthArg("c", "contexthalfwidth", "half window size of a context frame",false,15,"uint32_t");
TCLAP::ValueArg<uint32_t> numthreadsArg("t", "numthreads", "number of threads",false,12,"uint32_t");
TCLAP::ValueArg<uint32_t> negsamplesArg("n", "negsamples", "number of negative samples for skipgram model",false,15,"uint32_t");
TCLAP::ValueArg<uint32_t> randomseedArg("s", "randomseed", "seed for random number generator",false,2014,"uint32_t");
TCLAP::UnlabeledValueArg<string> corpus_fnameArg("corpusfname", "file containing the training corpus, one paragraph or sentence per line", true, "corpus", "corpusfname");
TCLAP::UnlabeledValueArg<string> output_basefnameArg("outputbasefname", "base filename for the learnt word embeddings", true, "wordreps-", "outputbasefname");
TCLAP::ValueArg<string> vocab_fnameArg("v", "vocabfname", "filename for the vocabulary and word counts", false, "wordsandcounts.txt", "filename");
cmd.add(dimArg);                                                            
cmd.add(mincountArg);                                                       
cmd.add(contexthalfwidthArg);                                               
cmd.add(numthreadsArg);                                                     
cmd.add(randomseedArg);                                                     
cmd.add(corpus_fnameArg);                                                   
cmd.add(output_basefnameArg);                                               
cmd.add(vocab_fnameArg);                                                    
cmd.parse(argc, argv);                                                      

mincount = mincountArg.getValue();                                          
dim = dimArg.getValue();                                                    
contexthalfwidth = contexthalfwidthArg.getValue();                          
negsamples = negsamplesArg.getValue();                                      
numthreads = numthreadsArg.getValue();                                      
randomseed = randomseedArg.getValue();                                      
corpus_fname = corpus_fnameArg.getValue();                                  
output_basefname = output_basefnameArg.getValue();                          
vocab_fname = vocab_fnameArg.getValue();                                    
}                                                                           
catch (TCLAP::ArgException &e) {};         

/*                                                                          
uint32_t mincount = 5;                                                      
uint32_t dim = 50;                                                          
uint32_t contexthalfwidth = 15;                                             
uint32_t negsamples = 15;                                                   
uint32_t numthreads = 10;                                                   
uint32_t randomseed = 2014;                                                 
string corpus_fname = "imdbtrain.txt";                                      
string output_basefname = "wordreps-";                                      
string vocab_fname = "wordsandcounts.txt";                                  
*/                                                                          

string test_fname = "imdbtest.txt";                                         
string output_fname = "parreps.txt";                                        
string countmat_fname = "counts.hdf5";                                      
Vocabulary * vocab;                                                                                                              

vocab = determineVocabulary(corpus_fname, mincount);                        
vocab->dump(vocab_fname);                                                   

Par2VecModel p2vm = Par2VecModel(corpus_fname, vocab, dim, contexthalfwidth, negsamples, randomseed);
p2vm.learn(numthreads);                                                     
p2vm.save(output_basefname);                                                
p2vm.learnparreps(test_fname, output_fname, numthreads); 

}

使用多线程的唯一地方是 Par2VecModel::learn 函数：

void Par2VecModel::learn(uint32_t numthreads) {                                 
thread* workers;                                                            
workers = new thread[numthreads];                                           
uint64_t numwords = 0;                                                      
bool killflag = 0;                                                          
uint32_t randseed;                                                          

ifstream filein(corpus_fname.c_str(), ifstream::ate | ifstream::binary);    
uint64_t filesize = filein.tellg();                                         

fprintf(stderr, "Total number of in vocab words to train over: %u\n", vocab->gettotalinvocabwords());

for(uint32_t idx = 0; idx < numthreads; idx++) {                            
    randseed = eng();                                                       
    workers[idx] = thread(skipgram_training_thread, this, numthreads, idx, filesize, randseed, std::ref(numwords));
}                                                                           

thread monitor(monitor_training_thread, this, numthreads, std::ref(numwords), std::ref(killflag));

for(uint32_t idx = 0; idx < numthreads; idx++)                              
    workers[idx].join();                                                    

killflag = true;                                                            
monitor.join();                                                             
}

这部分根本不涉及TCLAP，那是怎么回事？（我也在使用 c++11 功能，所以有 -std=c++11 标志，如果有区别的话）

【问题讨论】：

没有看到你的任何代码，这是不可能的。

标签： c++ multithreading compiler-optimization

【解决方案1】：

所以这已经开放了很长时间，这个建议可能不再有用，但我首先检查如果你用“简单”解析器替换 TCLAP 会发生什么（即只需在命令中输入参数以特定的固定顺序排列并将它们转换为正确的类型）。非常这个问题不太可能是由 TCLAP 引起的（即我无法想象这种行为的任何机制）。但是，可以想象，对于硬编码的值，编译器能够进行一些编译时优化，而当这些值必须是变量时，这些优化是不可能的。但是，性能差异的程度似乎有些病态，所以我仍然怀疑没有其他事情发生。

【讨论】：