在 sparklyr 中使用 semi_join 时出现匹配错误答案

【问题标题】：Match error when using semi_join in sparklyr在 sparklyr 中使用 semi_join 时出现匹配错误
【发布时间】：2019-12-10 19:08:40
【问题描述】：

我正在尝试在生成的ngrams 与列表匹配的 spark 数据框中加入两个表。

文章列表（df_sparklyr）：

id  description
1   In order to investigate the role of calcium pathway in myeloid  differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
2   This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
3   This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.

关键字列表（dict_tbl）：

[1] "3 M SYNDROME"                                                                
   [2] "3-M SYNDROME"                                                                
   [3] "3-M SYNDROME 1"                                                              
   [4] "3M SYNDROME"                                                                 
   [5] "DOLICHOSPONDYLIC DYSPLASIA"                                                  
   [6] "GLOOMY FACE SYNDROME"                                                        
   [7] "LE MERRER SYNDROME"                                                          
   [8] "THREE M SYNDROME"                                                            
   [9] "YAKUT SHORT STATURE SYNDROME"                                                
  [10] "ABDOMINAL AORTIC ANEURYSM"                                                   
  [11] "ANEURYSM ABDOMINAL AORTIC"                                                   
  [12] "AORTIC ANEURYSM ABDOMINAL"                                                   
  [13] "AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"                                        
  [14] "ABSENCE EPILEPSY"                                                            
  [15] "ABSENCE SEIZURE"                                                             
  [16] "CHILDHOOD ABSENCE EPILEPSY"                                                  
  [17] "JUVENILE ABSENCE EPILEPSY"                                                   
  [18] "PETIT MAL SEIZURE"                                                           
  [19] "PYKNOLEPSY"                                                                  
  [20] "ACANTHAMOEBA INFECTION"                                                      
  [21] "ACANTHAMOEBA INFECTIONS"                                                     
  [22] "ACANTHAMOEBA KERATITIS"                                                      
  [23] "ACCOMMODATIVE SPASM"

使用以下代码：

s_2 = df_sparklyr %>%
  ft_tokenizer("description", "words")%>%
  ft_ngram(input_col = "words", output_col = "ngrams")%>%
  semi_join(y = dict_tbl, by = c("ngrams" = "Keywords"))

我收到以下错误：

错误：org.apache.spark.sql.AnalysisException：由于数据类型不匹配，无法解析 '(outer() = RHS.Keywords)'：'(outer() = RHS.Keywords 中的不同类型)' (数组和字符串).;

【问题讨论】：

请通过添加示例数据使您的帖子更具可重复性。例如使用dput。 dput(head(df,n)) 选择 n 可能会很方便，因为您可能会发现足够的可重复性。
我添加了样本
dict_tbl 也是 spark 数据帧吗？
是的，都是spark数据框

标签： r sparkr sparklyr

【解决方案1】：

您似乎缺少一些东西， 1. 参数n 规定每个ngram 使用多少token 2. 函数explode 将那些每行的 ngram 列表到每行单独的 ngram 3. 通过加入，重命名要加入的列会更容易

这里有详细的方法，希望对你有帮助

第一步：生成火花数据帧

my_text = 
'In order to investigate the role of calcium pathway in myeloid  differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.'


my_col = my_text %>% strsplit(split = '\n') %>% unlist 

my_df <- 
as.data.frame(my_col, stringsAsFactors = FALSE) %>%  as_tibble() %>% 
rownames_to_column('id') %>%  
  rename(description = my_col)


my_spark_df <- my_df   %>% copy_to(sc, ., 'my_spark_df')

第二步：生成关键词列表

key_words <- c(
"3-M SYNDROME"                                                                
,"3-M SYNDROME 1"                                                              
,"3M SYNDROME"                                                                 
,"DOLICHOSPONDYLIC DYSPLASIA"                                                  
,"GLOOMY FACE SYNDROME"                                                        
,"LE MERRER SYNDROME"                                                          
,"THREE M SYNDROME"                                                            
,"YAKUT SHORT STATURE SYNDROME"                                                
,"ABDOMINAL AORTIC ANEURYSM"                                                   
,"ANEURYSM ABDOMINAL AORTIC"                                                   
,"AORTIC ANEURYSM ABDOMINAL"                                                   
,"AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"                                        
,"ABSENCE EPILEPSY"                                                            
,"ABSENCE SEIZURE"                                                             
,"CHILDHOOD ABSENCE EPILEPSY"                                                  
,"JUVENILE ABSENCE EPILEPSY"                                                   
,"PETIT MAL SEIZURE"                                                           
,"PYKNOLEPSY"                                                                  
,"ACANTHAMOEBA INFECTION"                                                      
,"ACANTHAMOEBA INFECTIONS"                                                     
,"ACANTHAMOEBA KERATITIS"                                                      
,"ACCOMMODATIVE SPASM")



key_words_spark_df <- 
as.data.frame(key_words, stringsAsFactors = FALSE) %>%  as_tibble() %>% 
  mutate(key_words = tolower(key_words)) %>%  
  copy_to(sc, ., 'keywords_spark')

加入

my_spark_df %>%
  ft_tokenizer("description", "words")%>%
  ft_ngram(input_col = "words", output_col = "ngrams", n = 2)%>% 
  mutate(ngrams = explode(ngrams)) %>%  
  select(id, ngrams) %>%  
  rename( key_words = ngrams) %>%  
  inner_join(key_words_spark_df)

【讨论】：

第一步，my_text，即df_sparklyr已经是一个Spark Dataframe。而不是列表的形式。
是的，但是为了重现您的示例，我需要重新创建 spark 数据框
但我不想丢失我的 id 列