【问题标题】:tidyr separate function delimiters to avoid repetitiontidyr 单独的函数分隔符以避免重复
【发布时间】:2018-07-09 09:56:59
【问题描述】:

我一直在使用 tidyr 将一些文本分成几列。

输入

Mypathcolon <- data.frame(c("1 Hospital: Random NHS Foundation Trust\nHospital Number: H2890235\nPatient Name:  al-Bilal, Widdad\nDOB:  1922-05-04\nGeneral Practitioner: Dr. Mondragon, Amber\nDate received:  2002-11-10\nClinical Details:  Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen. Nature of specimen:  Nature of specimen as stated on pot = 'Ascending colon x2 '|,Nature of specimen as stated on request form = 'rectum'|,Nature of specimen as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,Nature of specimen as stated on pot = 'rectal polyp '|\nMacroscopic description: 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm\nHistology:  The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4."))

names(Mypathcolon)<-c("PathReportWhole")

函数

Histoltree  <- c("Hospital Number:","Patient Name:",
   "DOB:","General Practitioner:","Date received:",
   "Clinical Details","Nature of specimen",
   "Macroscopic description:","Histology","Diagnosis")

Mypathcolon %>% 
   tidyr::separate(PathReportWhole, 
   into = c("added_name",Histoltree), 
     sep = paste(Histoltree, collapse = "|"))

当前输出

这给了我列名

 [1] "added_name"               "Hospital Number:"         "Patient Name:"            "DOB:"                    
 [5] "General Practitioner:"    "Date received:"           "Clinical Details"         "Nature of specimen"      
 [9] "Macroscopic description:" "Histology"                "Diagnosis"

但是“标本性质”到“诊断”栏目中的数据实际上在正文中包含了“标本性质”到“标本性质”的文字,而不是“标本性质”到“宏观描述”的文字应该是这样:实际输出见下文:

structure(list(added_name = "1 Hospital: Random NHS Foundation Trust\n", 
    `Hospital Number:` = " H2890235\n", `Patient Name:` = "  al-Bilal, Widdad\n", 
    `DOB:` = "  1922-05-04\n", `General Practitioner:` = " Dr. Mondragon, Amber\n", 
    `Date received:` = "  2002-11-10\n", `Clinical Details` = ":  Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen. ", 
    `Nature of specimen` = ":  ", `Macroscopic description:` = " as stated on pot = 'Ascending colon x2 '|,", 
    Histology = " as stated on request form = 'rectum'|,", Diagnosis = " as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,"), .Names = c("added_name", 
"Hospital Number:", "Patient Name:", "DOB:", "General Practitioner:", 
"Date received:", "Clinical Details", "Nature of specimen", "Macroscopic description:", 
"Histology", "Diagnosis"), row.names = 1L, class = "data.frame")

如何强制函数提取列出的分隔符​​之间的列,而不是它似乎正在做的重复提取。

想要的输出

Hospital: Random NHS Foundation Trust\n
Hospital Number: H2890235\n
Patient Name:  al-Bilal, Widdad\n
DOB:  1922-05-04\n
General Practitioner: Dr. Mondragon, Amber\n
Date received:  2002-11-10\n
Clinical Details:  Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen. 
Nature of specimen:  Nature of specimen as stated on pot = 'Ascending colon x2 '|,Nature of specimen as stated on request form = 'rectum'|,Nature of specimen as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,Nature of specimen as stated on pot = 'rectal polyp '|\n
Macroscopic description: 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm\n
Histology:  The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4.

【问题讨论】:

    标签: r tidyr


    【解决方案1】:

    在你修改之后,我可以看到你想要什么。这里的关键是用两种模式分割刺。然后,您要创建一个数据框。 splitstackshape 包中的cSplit() 可以使用分隔符(即:)分割字符串。

    library(dplyr)
    library(tidyr)
    library(stringi)
    library(splitstackshape)
    
    # Convert factor to character
    
    Mypathcolon$PathReportWhole <- as.character(Mypathcolon$PathReportWhole)
    
    # Split the string at two specific points, create a data frame,
    # assign a column name, split strings
    
    temp <- stri_split_regex(str = Mypathcolon$PathReportWhole, pattern = "\\n(?=[A-Z])|\\.\\s(?=.*:)") %>%
            as.data.frame %>%
            setNames("foo") %>%
            cSplit("foo", sep = ":", direction = "wide", type.convert = FALSE) 
    
    
                          foo_1
     1:              1 Hospital
     2:         Hospital Number
     3:            Patient Name
     4:                     DOB
     5:    General Practitioner
     6:           Date received
     7:        Clinical Details
     8:      Nature of specimen
     9: Macroscopic description
    10:               Histology
    
    foo_2
     1:                                                                                                                                                                                                                          Random NHS Foundation Trust
     2:                                                                                                                                                                                                                                             H2890235
     3:                                                                                                                                                                                                                                     al-Bilal, Widdad
     4:                                                                                                                                                                                                                                           1922-05-04
     5:                                                                                                                                                                                                                                 Dr. Mondragon, Amber
     6:                                                                                                                                                                                                                                           2002-11-10
     7:                                                                                                                                               Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen
     8: Nature of specimen as stated on pot = 'Ascending colon x2 '|,Nature of specimen as stated on request form = 'rectum'|,Nature of specimen as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,Nature of specimen as stated on pot = 'rectal polyp '|
     9:                                                                                                                                                               1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm
    10:                                                                                                                                                            The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4.
    

    【讨论】:

    • 嗨@jazzurro。该问题已被修改,以显示所需的输出。我希望组织学显示组织学和诊断之间的文本,宏观描述列显示“宏观描述”和“组织学”等之间的文本。目前,组织学列显示“标本性质”和下一个“之间的文本”标本性质'
    • @SebastianZeki 感谢您的留言。我看了你的预期结果,但我不清楚那个输入。你能清楚地表明你的期望吗?我确实很难思考你想要什么宏观、组织学和诊断。如果我没记错的话,您在原始数据中没有诊断。你能澄清一下是否是这种情况吗?
    • @SebastianZeki 我将您的预期结果转换为长格式并检查了您的预期。我在Macroscopic descriptionHistologyDiagnosis 中有Nature of specimen。但是如果我没记错的话,其余的文本与您期望的输出相匹配。你能仔细检查你想要的输出吗?在收到您的回复之前,我不会修改我的代码。
    • 嗨@jazzurro。是的,其余的文字都很好。只是宏观描述、组织学和诊断中的标本性质列在您的示例中文本不正确
    • @SebastianZeki 我稍微修改了我的代码。这足以有输出。如有必要,删除输出中的第一行。
    猜你喜欢
    • 1970-01-01
    • 2015-04-29
    • 2019-02-19
    • 1970-01-01
    • 2020-04-18
    • 2017-08-25
    • 1970-01-01
    • 2023-04-01
    • 2016-09-04
    相关资源
    最近更新 更多