按文本语言拆分字符串答案

【问题标题】：Splitting a string by text language按文本语言拆分字符串
【发布时间】：2018-02-22 22:43:23
【问题描述】：

我正在处理主要以英语为主要内容的文本文档库（每个存储为一个字符串），但包括一些西班牙语文档和一些以英语和西班牙语重复相同信息的文档。我使用了cld2 和cld3 包（它们在R 中实现了Chrome 的语言检测功能）来估计语料库中每个字符串中包含的语言。我的目标是处理所有包含英语和西班牙语文本的字符串，以便保留英语部分并删除西班牙语部分。

这是我正在使用的三个字符串的示例：

mixed.language.strings <- c("Department of Cultural Affairs and Special Events: Today will be the First Annual Mariachi and Folklorico Festival! Local groups begin at 1:00pm and world renowned headliners start at 3:00pm. It will be located in Millennium Park. Invite your friends, family, and neighbors to participate in this FREE event! \nEnjoy the weather on this beautiful Sunday! \n ************** \n Departamento de Asuntos Culturales y Eventos Especiales: Hoy será el Primer Festival Anual de Mariachi y Balet Folklórico! Los grupos locales comienzan a las 1:00 pm y los grupos de renombre mundial empiezan a las 3:00 pm. Será en el Millennium Park. Inviten a su familia, amigo@s, y vecin@s a este evento completamente GRATIS!", 
"Call or walk into our office for information on the Emergency Heating Repair Program which provides eligible low-income, owner-occupied homes grants for a new heating system.\n\nLlame o visite nuestra oficina para más información sobre un programa de la Ciudad ofreciendo dinero hacía la reparación o instalación de sistemas de calefacción. Dueños de casa de ingresos bajos son elegibles.   \n\n 3476 S. Archer Ave. \n (773) 523-8250", 
"Join me and other local elected officials for a workshop on appealing your property taxes. Homes in West & South townships of Cook County are currently eligible to appeal. See flier for more info, or call my office at 773-523-8250.\n\nLos invito a un taller sobre el proceso de apalear sus impuestos de propiedad. Hogares en los West y East “townships” del Condado de Cook son elegibles ahora para apalear sus impuestos. Por favor refiéranse al volante añadido a este mensaje, o llame mi oficina al 773-523-8250, para más información."
)

据我所知，cld2 和 cld3 可以估计字符串中包含的语言，但不能根据语言提取字符串的部分内容。

R 中是否有不同的包，我可以使用它来识别每种语言中每个字符串的部分，并据此将字符串分成两部分？

谢谢！抱歉，如果不清楚；这是我第一次发帖。

【问题讨论】：

在三个示例字符串中，看起来语言转换都是换行符分隔的。总是这样吗？也许您可以在换行符上将字符串分解为子字符串，并只保留看起来只有英文的子字符串。

标签： r text nlp

【解决方案1】：

这是我在评论中建议的方法的实现，如果您希望字符串中的语言之间有换行符，该方法将起作用。（在你所有的例子中都是这种情况。如果一般情况下不是这样，也许你可以尝试在换行符、句号、感叹号和问号上拆分）

library('cld2')

list.of.strings <- strsplit(mixed.language.strings, '\n')
ExtractEnglishSubstrings <- function(string.vector) {
  return(string.vector[which(detect_language(string.vector) == 'en')])
}

lapply(list.of.strings, ExtractEnglishSubstrings)

这个输出

[[1]]
[1] "Department of Cultural Affairs and Special Events: Today will be the First Annual Mariachi and Folklorico Festival! Local groups begin at 1:00pm and world renowned headliners start at 3:00pm. It will be located in Millennium Park. Invite your friends, family, and neighbors to participate in this FREE event! "
[2] "Enjoy the weather on this beautiful Sunday! "                                                                                                                                                                                                                                                                         

[[2]]
[1] "Call or walk into our office for information on the Emergency Heating Repair Program which provides eligible low-income, owner-occupied homes grants for a new heating system."

[[3]]
[1] "Join me and other local elected officials for a workshop on appealing your property taxes. Homes in West & South townships of Cook County are currently eligible to appeal. See flier for more info, or call my office at 773-523-8250."

如果您更喜欢将字符串缝合在一起并以向量而不是向量列表的形式返回，则此修改应该这样做...

ExtractEnglishSubstrings <- function(string.vector) {
  english.vector <- string.vector[which(detect_language(string.vector) == 'en')]
  reassembled.string <- paste0(english.vector, collapse=' ')
  return(reassembled.string)
}

unlist(lapply(list.of.strings, ExtractEnglishSubstrings))

[1] "Department of Cultural Affairs and Special Events: Today will be the First Annual Mariachi and Folklorico Festival! Local groups begin at 1:00pm and world renowned headliners start at 3:00pm. It will be located in Millennium Park. Invite your friends, family, and neighbors to participate in this FREE event!  Enjoy the weather on this beautiful Sunday! "
[2] "Call or walk into our office for information on the Emergency Heating Repair Program which provides eligible low-income, owner-occupied homes grants for a new heating system."                                                                                                                                                                                    
[3] "Join me and other local elected officials for a workshop on appealing your property taxes. Homes in West & South townships of Cook County are currently eligible to appeal. See flier for more info, or call my office at 773-523-8250."

【讨论】：

谢谢！！这是一个很好的解决方案。我没有注意到换行符的流行。快速检查显示，在我的语料库中的 95 个混合语言字符串中，其中 93 个包含换行符。所以我可以对那些使用这种方法，然后手动修复最后两个。

【解决方案2】：

对于其他正在解决类似问题的人，我对HardlandMason 提供的出色代码进行了一些修改。

保留子字符串下面的代码，类似于 HarlandMason 提供的，有两个变化：(1) 它不是将字符串向量作为其输入，而是将单个字符串和输出单个字符串； (2) 它允许您输入要保留的子字符串的语言（以cld2::detect_language 函数所需的格式）。

KeepSubstrings.bylanguage <- function(string, language) {
string.vector <- unlist(strsplit(string, '\n'))
cut.vector <- string.vector[which(cld2::detect_language(string.vector) == language)]
reassembled.string <- paste0(cut.vector, collapse=' ')
return(reassembled.string)
}

下面的代码类似，但 删除了某种语言的子字符串，而不是只保留某种语言的子字符串（如果某些子字符串非常短，从而对 @ 造成问题，这很有用987654324@函数，而你宁愿更谨慎地保留函数不确定的子字符串）：

RemoveSubstrings.bylanguage <- function(string, language) {
string.vector <- unlist(strsplit(string, '\n'))
cut.vector <- string.vector[which(cld2::detect_language(string.vector) != language)]
reassembled.string <- paste0(cut.vector, collapse=' ')
return(reassembled.string)
}

这两者都可以使用lapply 或sapply 应用于字符串向量，如上一个答案中所建议的那样。

【讨论】：