【问题标题】:how do you parse a sysmon file to extract certain information using R?您如何解析 sysmon 文件以使用 R 提取某些信息?
【发布时间】:2016-12-09 17:44:18
【问题描述】:

我正在尝试使用 R 读取大量此类文件以解析信息并将数据放入类似格式的数据帧中:

这是文件的内容:

    last_run                        current_run                     seconds     
 ------------------------------- ------------------------------- ----------- 
             Jul  4 2016  7:17AM             Jul  4 2016  7:21AM         226 


Engine Utilization (Tick %)   User Busy   System Busy    I/O Busy        Idle
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Engine 0                         5.0 %         0.4 %      22.4 %      72.1 % 
   Engine 1                         3.9 %         0.5 %      22.8 %      72.8 % 
   Engine 2                         5.6 %         0.3 %      22.5 %      71.6 % 
   Engine 3                         5.1 %         0.4 %      22.7 %      71.8 % 

     -------------------------  ------------  ------------  ----------  ---------- 
  Pool Summary        Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

  -------------------------  ------------  ------------  ----------  ---------- 
  Server Summary      Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

Transaction Profile
-------------------

  Transaction Summary             per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Committed Xacts                 137.3           n/a       41198     n/a     

     Average Runnable Tasks            1 min         5 min      15 min  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Global Queue                       0.0           0.0         0.0       0.0 %
   Engine 0                           0.0           0.1         0.1       0.6 %
   Engine 1                           0.0           0.0         0.0       0.0 %
   Engine 2                           0.2           0.1         0.1       2.6 %

  -------------------------  ------------  ------------  ----------             
  Pool Summary        Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1             

  -------------------------  ------------  ------------  ----------             
  Server Summary      Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1 

Device Activity Detail
  ----------------------

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_125                                         
    datadev_125                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_126                                         
    datadev_126                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_127                                         
    datadev_127                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             0.0           0.0           5       0.4 %
      Non-APF                         0.0           0.0           1       0.1 %
    Writes                            3.8           0.0        1128      99.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          3.8           0.0        1134       0.1 %

  Mirror Semaphore Granted            3.8           0.0        1134     100.0 %
  Mirror Semaphore Waited             0.0           0.0           0       0.0 %

  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /sybaser/database/sybaseR/dev/sybaseR.datadev_000                                    
    GPS_datadev_000               per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             7.9           0.0        2372      55.9 %
      Non-APF                         5.5           0.0        1635      38.6 %
    Writes                            0.8           0.0         233       5.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                         14.1           0.0        4240       0.3 %

  Mirror Semaphore Granted           14.1           0.0        4239     100.0 %
  Mirror Semaphore Waited             0.0           0.0           2       0.0 %

我需要将“Jul 4 2016 7:21AM”捕获为日期, 来自“Engine Utilization (Tick%) line, Server Summary ->Average "4.2%"

来自“事务配置文件”部分 -> 事务配置文件“计数”条目。

所以,我的数据框应该是这样的:

Date                     Cpu   Count
Jul  4 2016  7:21AM      4.2   41198 

谁能帮我解析这个文件来得到这些输出?

我尝试过这样的事情:

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

得到这条线:

Average         4.2 %         0.3 %      22.9 %      72.5 % 

但我希望能够仅在

之后立即提取平均值

Engine Utilization (Tick %),因为可能有很多行以 Average 开头。在 Engine Utilization (Tick %) 之后立即显示的平均线是我想要的。

如何将它放在这一行中以从该文件中提取此信息:

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

我可以在 read.table 行中使用 grep 来搜索某些字符吗?

【问题讨论】:

  • 注意您的问题描述有歧义。在某一时刻,您要求“来自 Engine Utilization (Tick%) 行,Server Summary ->Average 4.2%”。稍后您询问“在引擎利用率 (Tick %) 之后立即显示的平均线是我想要的。”后者指的是“池汇总平均值”。我下面的解决方案目前编写的方式,它假设了这些标准中的第二个。您能否在问题中澄清您想要哪一行?
  • 如果你可以将整个文件读入一个字符串变量,你可以在它上面运行一个正则表达式来提取你需要的东西。我不知道R,所以我不知道它是否可以读入变量或做正则表达式。

标签: regex r parsing pcre


【解决方案1】:

%%%% 第 1 枪 -- 开始工作了

extract <- function(filenam="file.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}

print(extract("file.txt"))

##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))

print(merged)

file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop

%%% Shot 2:第一次尝试提取(可能可变的)设备列数

extractv2 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]


    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
    nblocks <- length(ii_block_dev)
    ## A. get counts for each device
    ## for each block, select *last* line matching "Total I/Os"
    ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
    ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
        ## select matching liens to "Total I/Os" within each block
        IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                       ii_lines_IOs < ii_block_dev_aux[block+1]
                                   ]
        tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
    })
    lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
        Filter(function(v) v!="", strsplit(strng," ")[[1]])
    })
    IOs_counts <- sapply(lines_IOs, function(v) v[5])
    ## B. get device names:
    ## assumed to be on lines following each "Device:" match
    ii_devices <- 1 + ii_block_dev
    device_names <- sapply(ii_devices, function(ii){
        Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    })
    ## Create a data.frame with "device_names" as column names and "IOs_counts" as
    ## the values of a single row.
    ## Sorting the device names by order() will help produce the same column names
    ## if different sysmon files list the devices in different order
    ord <- order(device_names)
    devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
                             check.names=FALSE) ## Prevent R from messing with our device names

    data.frame(stringsAsFactors=FALSE, check.names=FALSE,
               Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))


## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)

%%%%%%% 镜头 3:提取两个表,一个具有单行,另一个具有可变行数(取决于每个 sysmon 文件中列出的设备)。

extractv3 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    ## first part of output: fixed three-column structure
    fixed <-  data.frame(stringsAsFactors=FALSE,
                         Date=date_current_run, Cpu=Cpu, Count=count)

    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    if(length(ii_block_dev)==0){
        variable <- data.frame(stringsAsFactors=FALSE,
                               date_current_run=date_current_run,
                               device_names=NA, IOs_counts=NA)
    }else{
        ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
        nblocks <- length(ii_block_dev)
        if(length(ii_block_dev)==0){
            sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??")
            ##stop()
        }
        ## A. get counts for each device
        ## for each block, select *last* line matching "Total I/Os"
        ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
        ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
            ## select matching lines to "Total I/Os" within each block
            IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                           ii_lines_IOs < ii_block_dev_aux[block+1]
                                          ]
            tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
        })
        lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
            Filter(function(v) v!="", strsplit(strng," ")[[1]])
        })
        IOs_counts <- sapply(lines_IOs, function(v) v[5])
        ## B. get device names:
        ## assumed to be on lines following each "Device:" match
        ii_devices <- 1 + ii_block_dev
        device_names <- sapply(ii_devices, function(ii){
            Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
        })
        ## Create a data.frame with three columns: date, device, counts
        variable <- data.frame(stringsAsFactors=FALSE,
                               date_current_run=rep(date_current_run, length(IOs_counts)),
                               device_names=device_names, IOs_counts=IOs_counts)
    }
    list(fixed=fixed, variable=variable)
}
print(extractv3("file2.txt"))


file.list <- c("file.txt","file2.txt","file3.txt")
res <- lapply(file.list, extractv3)

fixed.merged <- do.call("rbind", lapply(res, function(r) r$fixed))
print(fixed.merged)

variable.merged <- do.call("rbind", lapply(res, function(r) r$variable))
print(variable.merged)

【讨论】:

  • 感谢它的魅力。你的回答给了我关于解析不同文件的好主意。
  • 我知道你已经回答了这个问题。如果你不介意,我还有一个补充。我已经添加到原始帖子的末尾。我也想从这个文件中获取这个值:“设备:/sybaser/database/sybaseR/dev/sybaseR.temp10_logdev_001”作为列名,“Total I/Os 0.9 0.0 262 0.0 %”262 来自这一行。这些条目有很多,并且喜欢为每个设备条目创建一个列。
  • 你能聊天吗?
  • 是的,我可以聊天。事实上,我没有理解您的最后一个问题,并且在原始帖子中找不到您在评论中提到的新条目。
  • 你能加入这个聊天吗:chat.stackoverflow.com/rooms/121644/sysmon
【解决方案2】:

使用专用程序有时可以更轻松地处理文本文件。例如。 gawk 专门用于在文本文件中查找模式并从中输出数据。我们可以使用一个简短的 gawk 脚本来获取加载到 R 中所需的数据。注意,脚本的每一行都包含一个要查找的模式,然后是一个包含在{} 中的操作。 NR 是一个计数器,用于计算到目前为止读取的行数。

BEGIN                          {OFS = ""; ORS = ""}
/current_run/                  {dat_line = NR+2; cpu_done = 0}
/Server Summary/               {cpu_line = NR+1}
/Transaction Summary/          {cnt_line = NR+2}
NR == dat_line                 {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0  {print $2," "; cpu_done = 1}
NR == cnt_line                 {print $5,"\n"}

将此脚本保存为“ext.awk”,然后将所有数据文件提取到一个 R 数据框中(假设它们都位于一个文件夹中并具有扩展名 .txt

df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))

注意,gawk 已经安装在大多数 Linux 版本上。在 Windows 上,您可能需要从 http://gnuwin32.sourceforge.net/packages/gawk.htm 安装它

【讨论】:

    【解决方案3】:

    用于读取文件 在这里,我假设 CSV 作为文件类型。 对于其他人,请访问 http://www.r-tutor.com/r-introduction/data-frame/data-import

    >utilization <- read.csv(file="",head=TRUE)
    >serverSummary <-read.csv(file="",head=TRUE)
    >transcProfile <- read.csv(file="",head=TRUE)
    

    ==>merge 只接受两个参数

    >data <- merge(utilization,serverSummary)
    >dataframe <-merge(data,transcProfile)
    

    现在您将拥有数据框中的所有列

    >dataframe
    

    你可以看到数据框中的所有列

    根据需要提取列 ==>subset( ) 函数是选择变量和观察值的最简单方法

    >subset(dataframe,select=c("last_run","Average","Transaction Profile")
    

    现在您可以将其写入 CSV 或任何文件类型

    >write.csv(dataframe, file = "MyData.csv")
    

    用于将所有文件合并在一起

    multmerge = function(mypath){
    filenames=list.files(path=mypath, full.names=TRUE)
    datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
    Reduce(function(x,y) {merge(x,y)}, datalist)
    

    运行代码定义函数后,您就可以使用它了。该函数采用路径。此路径应该是一个文件夹的名称,该文件夹包含您要阅读和合并的所有文件,并且仅包含您要合并的那些文件。考虑到这一点,我有两个提示:

    在你使用这个功能之前,我的建议是在一个短目录下创建一个新文件夹(例如,这个文件夹的路径可以是“C://R//mergeme”)并保存你所有的文件想合并到那个文件夹中。 此外,请确保将进行匹配的列在每个文件中的格式相同(并且具有相同的名称)。 假设您将 20 个文件保存到位于“C://R//mergeme”的 mergeme 文件夹中,并且您想阅读并合并它们。要使用我的函数,请使用以下语法:

    mymergeddata = multmerge(“C://R//mergeme”)
    

    运行此命令后,您将拥有一个完全合并的数据框,其中所有变量都相互匹配

    现在您可以根据所需列对数据框进行子集化。

    【讨论】:

    • 这需要很长时间才能完成,因为我一直在合并文件并且有数千个文件要处理。
    【解决方案4】:

    使用readLinesstringi::stri_read_lines 将文件内容作为字符向量读取。后者通常更快,但不那么成熟,并且偶尔会在不寻常的内容上中断。

    lines <- readLines("the file name")
    

    对于快速正则表达式匹配,stringi 通常是最佳选择。 rebus.datetimes 允许您从 strptime 日期格式字符串生成正则表达式。

    查找当前运行日期

    current_run 出现的行是:

    library(stringi)
    library(rebus.datetimes)
    
    i_current_run <- which(stri_detect_fixed(lines, "current_run"))
    

    要提取日期,此代码仅查看找到当前运行的那一行之后的第二行,但该代码是可矢量化的,因此如果您有不存在该假设的文件,您可以轻松查看所有行持有。

    date_format <- "%b%t%d%t%Y%t%H:%M%p"
    rx_date <- rebus.datetimes::datetime(date_format, io = "input")
    extracted_dates <- stri_extract_all_regex(lines[i_current_run + 2], rx_date)
    current_run_date <- strptime(
      extracted_dates[[1]][2], date_format, tz = "UTC"
    )
    ## [1] "2016-07-04 07:21:00 UTC"
    

    正在查找 % 用户忙碌

    “引擎利用率”部分可通过

    找到
    i_engine_util <- which(
      stri_detect_fixed(lines, "Engine Utilization (Tick %)")
    )
    

    我们希望在这一行之后出现“服务器摘要”的第一个实例。

    i_server_summary <- i_engine_util + 
      min(which(
        stri_detect_fixed(lines[(i_engine_util + 1):n_lines], "Server Summary")
      ))
    

    使用正则表达式从下一行中提取数字。

    user_busy <- as.numeric(
      stri_extract_first_regex(lines[i_server_summary + 1], "[0-9]+(?:\\.[0-9])")
    )
    ## [1] 4.2
    

    查找已提交 xact 的计数

    “提交的 Xacts”行是

    i_comm_xacts <- which(stri_detect_fixed(lines, "Committed Xacts"))
    

    计数值是一组被空格包围的数字。

    xacts_count <- as.integer(
      stri_extract_all_regex(lines[i_comm_xacts], "(?<= )[0-9]+(?= )")
    )
    ## [1] 41198
    

    合并结果

    data.frame(
      Date = current_run_date,
      CPU = user_busy,
      Count = xacts_count
    )
    

    【讨论】:

      猜你喜欢
      • 2015-04-19
      • 1970-01-01
      • 2013-08-17
      • 1970-01-01
      • 1970-01-01
      • 2020-07-21
      • 1970-01-01
      • 1970-01-01
      • 2020-12-28
      相关资源
      最近更新 更多