您如何解析 sysmon 文件以使用 R 提取某些信息？答案

【问题标题】：how do you parse a sysmon file to extract certain information using R?您如何解析 sysmon 文件以使用 R 提取某些信息？
【发布时间】：2016-12-09 17:44:18
【问题描述】：

我正在尝试使用 R 读取大量此类文件以解析信息并将数据放入类似格式的数据帧中：

这是文件的内容：

    last_run                        current_run                     seconds     
 ------------------------------- ------------------------------- ----------- 
             Jul  4 2016  7:17AM             Jul  4 2016  7:21AM         226 


Engine Utilization (Tick %)   User Busy   System Busy    I/O Busy        Idle
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Engine 0                         5.0 %         0.4 %      22.4 %      72.1 % 
   Engine 1                         3.9 %         0.5 %      22.8 %      72.8 % 
   Engine 2                         5.6 %         0.3 %      22.5 %      71.6 % 
   Engine 3                         5.1 %         0.4 %      22.7 %      71.8 % 

     -------------------------  ------------  ------------  ----------  ---------- 
  Pool Summary        Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

  -------------------------  ------------  ------------  ----------  ---------- 
  Server Summary      Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

Transaction Profile
-------------------

  Transaction Summary             per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Committed Xacts                 137.3           n/a       41198     n/a     

     Average Runnable Tasks            1 min         5 min      15 min  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Global Queue                       0.0           0.0         0.0       0.0 %
   Engine 0                           0.0           0.1         0.1       0.6 %
   Engine 1                           0.0           0.0         0.0       0.0 %
   Engine 2                           0.2           0.1         0.1       2.6 %

  -------------------------  ------------  ------------  ----------             
  Pool Summary        Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1             

  -------------------------  ------------  ------------  ----------             
  Server Summary      Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1 

Device Activity Detail
  ----------------------

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_125                                         
    datadev_125                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_126                                         
    datadev_126                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_127                                         
    datadev_127                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             0.0           0.0           5       0.4 %
      Non-APF                         0.0           0.0           1       0.1 %
    Writes                            3.8           0.0        1128      99.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          3.8           0.0        1134       0.1 %

  Mirror Semaphore Granted            3.8           0.0        1134     100.0 %
  Mirror Semaphore Waited             0.0           0.0           0       0.0 %

  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /sybaser/database/sybaseR/dev/sybaseR.datadev_000                                    
    GPS_datadev_000               per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             7.9           0.0        2372      55.9 %
      Non-APF                         5.5           0.0        1635      38.6 %
    Writes                            0.8           0.0         233       5.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                         14.1           0.0        4240       0.3 %

  Mirror Semaphore Granted           14.1           0.0        4239     100.0 %
  Mirror Semaphore Waited             0.0           0.0           2       0.0 %

我需要将“Jul 4 2016 7:21AM”捕获为日期，来自“Engine Utilization (Tick%) line, Server Summary ->Average "4.2%"

来自“事务配置文件”部分 -> 事务配置文件“计数”条目。

所以，我的数据框应该是这样的：

Date                     Cpu   Count
Jul  4 2016  7:21AM      4.2   41198

谁能帮我解析这个文件来得到这些输出？

我尝试过这样的事情：

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

得到这条线：

Average         4.2 %         0.3 %      22.9 %      72.5 %

但我希望能够仅在

之后立即提取平均值

Engine Utilization (Tick %)，因为可能有很多行以 Average 开头。在 Engine Utilization (Tick %) 之后立即显示的平均线是我想要的。

如何将它放在这一行中以从该文件中提取此信息：

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

我可以在 read.table 行中使用 grep 来搜索某些字符吗？

【问题讨论】：

注意您的问题描述有歧义。在某一时刻，您要求“来自 Engine Utilization (Tick%) 行，Server Summary ->Average 4.2%”。稍后您询问“在引擎利用率 (Tick %) 之后立即显示的平均线是我想要的。”后者指的是“池汇总平均值”。我下面的解决方案目前编写的方式，它假设了这些标准中的第二个。您能否在问题中澄清您想要哪一行？
如果你可以将整个文件读入一个字符串变量，你可以在它上面运行一个正则表达式来提取你需要的东西。我不知道R，所以我不知道它是否可以读入变量或做正则表达式。

标签： regex r parsing pcre

【解决方案1】：

%%%% 第 1 枪 -- 开始工作了

extract <- function(filenam="file.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}

print(extract("file.txt"))

##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))

print(merged)

file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop

%%% Shot 2：第一次尝试提取（可能可变的）设备列数

extractv2 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]


    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
    nblocks <- length(ii_block_dev)
    ## A. get counts for each device
    ## for each block, select *last* line matching "Total I/Os"
    ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
    ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
        ## select matching liens to "Total I/Os" within each block
        IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                       ii_lines_IOs < ii_block_dev_aux[block+1]
                                   ]
        tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
    })
    lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
        Filter(function(v) v!="", strsplit(strng," ")[[1]])
    })
    IOs_counts <- sapply(lines_IOs, function(v) v[5])
    ## B. get device names:
    ## assumed to be on lines following each "Device:" match
    ii_devices <- 1 + ii_block_dev
    device_names <- sapply(ii_devices, function(ii){
        Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    })
    ## Create a data.frame with "device_names" as column names and "IOs_counts" as
    ## the values of a single row.
    ## Sorting the device names by order() will help produce the same column names
    ## if different sysmon files list the devices in different order
    ord <- order(device_names)
    devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
                             check.names=FALSE) ## Prevent R from messing with our device names

    data.frame(stringsAsFactors=FALSE, check.names=FALSE,
               Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))


## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)

%%%%%%% 镜头 3：提取两个表，一个具有单行，另一个具有可变行数（取决于每个 sysmon 文件中列出的设备）。

extractv3 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    ## first part of output: fixed three-column structure
    fixed <-  data.frame(stringsAsFactors=FALSE,
                         Date=date_current_run, Cpu=Cpu, Count=count)

    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    if(length(ii_block_dev)==0){
        variable <- data.frame(stringsAsFactors=FALSE,
                               date_current_run=date_current_run,
                               device_names=NA, IOs_counts=NA)
    }else{
        ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
        nblocks <- length(ii_block_dev)
        if(length(ii_block_dev)==0){
            sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??")
            ##stop()
        }
        ## A. get counts for each device
        ## for each block, select *last* line matching "Total I/Os"
        ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
        ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
            ## select matching lines to "Total I/Os" within each block
            IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                           ii_lines_IOs < ii_block_dev_aux[block+1]
                                          ]
            tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
        })
        lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
            Filter(function(v) v!="", strsplit(strng," ")[[1]])
        })
        IOs_counts <- sapply(lines_IOs, function(v) v[5])
        ## B. get device names:
        ## assumed to be on lines following each "Device:" match
        ii_devices <- 1 + ii_block_dev
        device_names <- sapply(ii_devices, function(ii){
            Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
        })
        ## Create a data.frame with three columns: date, device, counts
        variable <- data.frame(stringsAsFactors=FALSE,
                               date_current_run=rep(date_current_run, length(IOs_counts)),
                               device_names=device_names, IOs_counts=IOs_counts)
    }
    list(fixed=fixed, variable=variable)
}
print(extractv3("file2.txt"))


file.list <- c("file.txt","file2.txt","file3.txt")
res <- lapply(file.list, extractv3)

fixed.merged <- do.call("rbind", lapply(res, function(r) r$fixed))
print(fixed.merged)

variable.merged <- do.call("rbind", lapply(res, function(r) r$variable))
print(variable.merged)

【讨论】：

感谢它的魅力。你的回答给了我关于解析不同文件的好主意。
我知道你已经回答了这个问题。如果你不介意，我还有一个补充。我已经添加到原始帖子的末尾。我也想从这个文件中获取这个值：“设备：/sybaser/database/sybaseR/dev/sybaseR.temp10_logdev_001”作为列名，“Total I/Os 0.9 0.0 262 0.0 %”262 来自这一行。这些条目有很多，并且喜欢为每个设备条目创建一个列。
你能聊天吗？
是的，我可以聊天。事实上，我没有理解您的最后一个问题，并且在原始帖子中找不到您在评论中提到的新条目。
你能加入这个聊天吗：chat.stackoverflow.com/rooms/121644/sysmon

【解决方案2】：

使用专用程序有时可以更轻松地处理文本文件。例如。 gawk 专门用于在文本文件中查找模式并从中输出数据。我们可以使用一个简短的 gawk 脚本来获取加载到 R 中所需的数据。注意，脚本的每一行都包含一个要查找的模式，然后是一个包含在{} 中的操作。 NR 是一个计数器，用于计算到目前为止读取的行数。

BEGIN                          {OFS = ""; ORS = ""}
/current_run/                  {dat_line = NR+2; cpu_done = 0}
/Server Summary/               {cpu_line = NR+1}
/Transaction Summary/          {cnt_line = NR+2}
NR == dat_line                 {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0  {print $2," "; cpu_done = 1}
NR == cnt_line                 {print $5,"\n"}

将此脚本保存为“ext.awk”，然后将所有数据文件提取到一个 R 数据框中（假设它们都位于一个文件夹中并具有扩展名 .txt）

df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))

注意，gawk 已经安装在大多数 Linux 版本上。在 Windows 上，您可能需要从 http://gnuwin32.sourceforge.net/packages/gawk.htm 安装它

【讨论】：

【解决方案3】：

用于读取文件 在这里，我假设 CSV 作为文件类型。对于其他人，请访问 http://www.r-tutor.com/r-introduction/data-frame/data-import

>utilization <- read.csv(file="",head=TRUE)
>serverSummary <-read.csv(file="",head=TRUE)
>transcProfile <- read.csv(file="",head=TRUE)

==>merge 只接受两个参数

>data <- merge(utilization,serverSummary)
>dataframe <-merge(data,transcProfile)

现在您将拥有数据框中的所有列

>dataframe

你可以看到数据框中的所有列

根据需要提取列 ==>subset( ) 函数是选择变量和观察值的最简单方法

>subset(dataframe,select=c("last_run","Average","Transaction Profile")

现在您可以将其写入 CSV 或任何文件类型

>write.csv(dataframe, file = "MyData.csv")

用于将所有文件合并在一起

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)

运行代码定义函数后，您就可以使用它了。该函数采用路径。此路径应该是一个文件夹的名称，该文件夹包含您要阅读和合并的所有文件，并且仅包含您要合并的那些文件。考虑到这一点，我有两个提示：

在你使用这个功能之前，我的建议是在一个短目录下创建一个新文件夹（例如，这个文件夹的路径可以是“C://R//mergeme”）并保存你所有的文件想合并到那个文件夹中。此外，请确保将进行匹配的列在每个文件中的格式相同（并且具有相同的名称）。假设您将 20 个文件保存到位于“C://R//mergeme”的 mergeme 文件夹中，并且您想阅读并合并它们。要使用我的函数，请使用以下语法：

mymergeddata = multmerge(“C://R//mergeme”)

运行此命令后，您将拥有一个完全合并的数据框，其中所有变量都相互匹配

现在您可以根据所需列对数据框进行子集化。

【讨论】：

这需要很长时间才能完成，因为我一直在合并文件并且有数千个文件要处理。

【解决方案4】：

使用readLines 或stringi::stri_read_lines 将文件内容作为字符向量读取。后者通常更快，但不那么成熟，并且偶尔会在不寻常的内容上中断。

lines <- readLines("the file name")

对于快速正则表达式匹配，stringi 通常是最佳选择。 rebus.datetimes 允许您从 strptime 日期格式字符串生成正则表达式。

查找当前运行日期

current_run 出现的行是：

library(stringi)
library(rebus.datetimes)

i_current_run <- which(stri_detect_fixed(lines, "current_run"))

要提取日期，此代码仅查看找到当前运行的那一行之后的第二行，但该代码是可矢量化的，因此如果您有不存在该假设的文件，您可以轻松查看所有行持有。

date_format <- "%b%t%d%t%Y%t%H:%M%p"
rx_date <- rebus.datetimes::datetime(date_format, io = "input")
extracted_dates <- stri_extract_all_regex(lines[i_current_run + 2], rx_date)
current_run_date <- strptime(
  extracted_dates[[1]][2], date_format, tz = "UTC"
)
## [1] "2016-07-04 07:21:00 UTC"

正在查找 % 用户忙碌

“引擎利用率”部分可通过

找到

i_engine_util <- which(
  stri_detect_fixed(lines, "Engine Utilization (Tick %)")
)

我们希望在这一行之后出现“服务器摘要”的第一个实例。

i_server_summary <- i_engine_util + 
  min(which(
    stri_detect_fixed(lines[(i_engine_util + 1):n_lines], "Server Summary")
  ))

使用正则表达式从下一行中提取数字。

user_busy <- as.numeric(
  stri_extract_first_regex(lines[i_server_summary + 1], "[0-9]+(?:\\.[0-9])")
)
## [1] 4.2

查找已提交 xact 的计数

“提交的 Xacts”行是

i_comm_xacts <- which(stri_detect_fixed(lines, "Committed Xacts"))

计数值是一组被空格包围的数字。

xacts_count <- as.integer(
  stri_extract_all_regex(lines[i_comm_xacts], "(?<= )[0-9]+(?= )")
)
## [1] 41198

合并结果

data.frame(
  Date = current_run_date,
  CPU = user_busy,
  Count = xacts_count
)

【讨论】：