如何计算R中数据框中字符串中“c（\”）的出现次数？答案

【问题标题】：How to count the occurrences of "c(\" in a string in a data frame in R?如何计算R中数据框中字符串中“c（\”）的出现次数？
【发布时间】：2021-12-29 23:34:31
【问题描述】：

我有一个数据框，其中某些列包含来自 Mplus 的错误和警告消息。文本以一种奇怪的格式保存，因此我希望通过计算单元格中 c(\ 的出现次数来简单地计算消息的数量，而不是尝试处理每条消息，因为它是出现在每个消息之前的唯一字符组合警告或错误。

例如，一个单元格包含消息：

[[1]]
[1] "c(\"All variables are uncorrelated with all other variables within class.\""
[2] " \"Check that this is what is intended.\""                                  
[3] " \"1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS\")"                         
[4] " c(\"WARNING:  THE BEST LOGLIKELIHOOD VALUE WAS NOT REPLICATED.  THE\""     
[5] " \"SOLUTION MAY NOT BE TRUSTWORTHY DUE TO LOCAL MAXIMA.  INCREASE THE\""    
[6] " \"NUMBER OF RANDOM STARTS.\")"

而另一个包含这样的较短消息：

[[1]]
[1] "c(\"All variables are uncorrelated with all other variables within class.\""
[2] " \"Check that this is what is intended.\""                                  
[3] " \"1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS\")"

我尝试了几种不同的方式使用 str_count，包括我最近的尝试：

    str_count(test#, '//c(\//')

但我收到错误：Error: '\/' is an unrecognized escape in character string starting "'//c(\/"。理想情况下，第一个示例返回 2，第二个示例返回 1。

当这个唯一字符串包含的字符无法封装或转义时，我如何计算它的出现次数？

这里有一些易于使用的测试代码来试一试！

test1 <- '"c(\"All variables are uncorrelated with all other variables within class.\"" " \"Check that this is what is intended.\"" " \"1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS\")"'

test2 <- '"c(\"All variables are uncorrelated with all other variables within class.\"" " \"Check that this is what is intended.\"" " \"1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS\")" " c(\"WARNING:  THE BEST LOGLIKELIHOOD VALUE WAS NOT REPLICATED.  THE\"" " \"SOLUTION MAY NOT BE TRUSTWORTHY DUE TO LOCAL MAXIMA.  INCREASE THE\"" " \"NUMBER OF RANDOM STARTS.\")"'

【问题讨论】：

不是您的问题的解决方案，但您是否考虑过使用lavaan 直接在 R 中进行 SEM？
在我看来，将问题简化为只找到c( 可能更容易，您可以这样做：str_count(test1, "c\\(")
这看起来 data.frame 构造不佳；最好保留原始的“字符向量列表”格式（或者它是否更复杂？）并按照df = data.frame(x = 1:2); df$y = list(c("a", "b"), "d"); lengths(df$y) 的行使用，例如lengths()。
我们查看了 lavaan，但是关于估计器或整个输入选项的一些事情让我的顾问认为 Mplus 是最好的选择，所以此时我无法控制。 @deschen
@D.J 这实际上可以很好地工作，我想我没有完全理解转义选项是如何完全工作的 - ( 和 \ 都给我带来了很多麻烦。

标签： r string count

【解决方案1】：

您可以尝试在我的评论中减少要计算的部分

str_count(test1, "c\\(")

或者您可以通过检查c(\" 来延长参数并使用fixed() 参数：

str_count(test1, fixed('c(\"'))

如您所见，两种方式都显示正确答案：

string1 <- 'c(\"All variables are uncorrelated with all other variables within class.\"" 
             " \"Check that this is what is intended.\"" 
             " \"1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS\")" 
             " c(\"WARNING:  THE BEST LOGLIKELIHOOD VALUE WAS NOT REPLICATED. 
             THE\"" " \"SOLUTION MAY NOT BE TRUSTWORTHY DUE TO LOCAL MAXIMA.  INCREASE THE\""
             " \"NUMBER OF RANDOM STARTS.\")'

> str_count(string1, fixed('c(\"'))
[1] 2
> str_count(string1, "c\\(")
[1] 2

【讨论】：

【解决方案2】：

你可以试试gregexpr()。

test1 <- '"c(\" foo bar baz'
test2 <- '"c(\" foo bar baz "c(\" baz bar foo'

length(unlist(gregexpr('c\\(', test1)))
# [1] 1
length(unlist(gregexpr('c\\(', test2)))
# [1] 2
length(unlist(gregexpr('c\\(', list(test1, test2))))
# [1] 3

【讨论】：