【问题标题】:Parallel processed sentence generation creates garbled results并行处理的句子生成产生乱码结果
【发布时间】:2018-08-21 06:10:55
【问题描述】:

我正在尝试为某些神经网络学习目的创建一个数据集。以前,我使用 for 循环来连接和造句,但由于这个过程需要很长时间,我使用 foreach 实现了句子生成。该过程很快并在 50 秒内完成。我只是在模板上使用插槽填充,然后将其粘贴在一起形成一个句子,但输出变得乱码(单词中的拼写错误、单词之间的未知空格、单词本身丢失等)。

library(foreach)
library(doParallel)
library(tictoc)

tic("Data preparation - parallel mode")
cl <- makeCluster(3)
registerDoParallel(cl)

f_sentences<-c();sentences<-c()
hr=38:180;fl=1:5;month=1:5
strt<-Sys.time()
a<-foreach(hr=38:180,.packages = c('foreach','doParallel')) %dopar% {
  foreach(fl=1:5,.packages = c('foreach','doParallel')) %dopar%{
    foreach(month=1:5,.packages = c('foreach','doParallel')) %dopar% {
      if(hr>=35 & hr<=44){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=45 & hr<=59){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=60 & hr<=100){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being medium).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=101 & hr<=150){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=151 & hr<=180){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      return(outfile)
    }
    write.table(outfile,file="/home/outfile.txt",append = T,row.names = F,col.names = F)
    gc()
  }
}
stopCluster(cl)
toc()

这样创建的文件的统计数据:

  • 行数:427,975
  • 使用拆分:分词 (" ")
  • 词汇:567

    path&lt;-"/home/outfile.txt"
    File&lt;-(fread(path,sep = "\n",header = F))[[1]]
    corpus&lt;-tolower(File) %&gt;%
    #removePunctuation() %&gt;%
    strsplit(splitting) %&gt;%
    unlist()
    vocab&lt;-unique(corpus)

    像这样的简单句子的词汇量应该很少,因为数字是这里唯一变化的参数。在检查词汇输出并使用 grep 命令时,我发现了很多乱码 (也有一些遗漏的单词)像 wenttcrpply 等出现在句子中,通常不应该出现,因为我有一个固定的模板。

    预期句子
    “大约有 40 名士兵在战斗中丧生(计数为 severly_low)。大约 1 名士兵和平民失踪。我们只有大约 146 个板条箱,可以使用 1几个月作为食物供应”

    grep -rnw 'outfile.txt' -e 'wentt'
    24105:“大约 62 名士兵在战斗中丧生(中等人数)。大约 2 名士兵和平民得到了 117 个板条箱,可作为食物供应持续 1 个月”

    grep -rnw 'outfile.txt' -e 'crpply'
    76450:“大约 73 名士兵在战斗中丧生(中等人数)。大约 1 名士兵和平民失踪了。我们只有大约 133 个 crpply"

    对于前几句,出现问题后生成正确。这是什么原因?我只是在执行带有插槽填充的普通粘贴。任何帮助将不胜感激!

【问题讨论】:

    标签: r machine-learning foreach nlp doparallel


    【解决方案1】:

    代码现在运行正常。没有更多的错误。我假设错误是由于上次故障而发生的。在其他具有不同 R 版本的机器上对此进行了测试,仍然没有问题。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-10-19
      • 2017-02-06
      • 2018-08-14
      • 2011-12-24
      相关资源
      最近更新 更多