Core OS Events in Windows 7, Part 1

Dr. Insung Park and Alex Bendetovers

Today's computer software constantly breaks new grounds. Consumer software applications offer a sophisticated set of features that enable rich new experiences. Powerful server applications are setting new records in throughput, speed and scale. These improvements have been made possible by rapid progress in hardware technologies and continuous adoption of software advancements in optimization, virtualization, and distributed and parallel computing. However, as a result, software applications have become larger and more complicated. At the same time, users' expectations about software quality are higher than ever. Fundamental characteristics such as performance, reliability and manageability have proved essential in the long-term success of software products, and they are often celebrated as primary features.

Increasing software complexity and higher user expectations on quality thus present a difficult challenge in software development. When an unexpected problem occurs, predicting internal states of all relevant components is nearly impossible. Retracing the history of execution flows is cumbersome and tricky, but often necessary in finding out the root cause of software problems. When users report problems after deployment, they expect the root cause of the problem to be quickly identified and addressed. The overwhelming number of hardware and software combinations, different workload characteristics, and usage patterns of end users make such tasks even tougher. The ability to use a mechanism that enables you to understand system execution in a transparent manner, with minimal overhead, is invaluable.

Event Instrumentation

Instrumentation is one such effective solution in measuring and improving software quality. Software performance counters have provided a convenient way to monitor application execution status and resource usage at an aggregate level. Event instrumentation has also been popular over the years. Events raised by a software component at different stages of execution can significantly reduce the time it takes to diagnose various problems. In addition to scanning for certain events or patterns of events, one can apply data mining and correlation techniques to further analyze the events to produce meaningful statistics and reports on program execution and problematic behavior. The ability to collect events on production systems in real time helps avoid the need to have an unwieldy debugger setup on customer machines.

Introduced in the Windows 2000 operating system, Event Tracing for Windows (ETW) is a general-purpose event-tracing platform on Windows operating systems. Using an efficient buffering and logging mechanism implemented in the kernel, ETW provides a mechanism to persist events raised by both user-mode applications and kernel-mode device drivers. Additionally, ETW gives users the ability to enable and disable logging dynamically, making it easy to perform detailed tracing in production environments without requiring reboots or application restarts.

The operating system itself has been heavily instrumented with ETW events. The ability to analyze and simulate core OS activities based on ETW events in development, as well as on production-mode systems, has been valuable to developers in solving many quality problems. With each subsequent Windows release, the number of ETW events raised by the operating system has increased; Windows 7 is the most instrumented operating system to date. In addition, Windows 7 contains tools that can utilize these operating system ETW events to analyze system performance and reliability, as well as uncover quality problems in software applications.

Many application problems surface as anomalies in OS resource usage, such as unexpected patterns or spikes in the consumption of CPU, memory, network bandwidth, IOs and so on. Because OS events for most system activities can be traced to the originating process and thread, one can make considerable progress in narrowing down possible root causes of many application problems, even without ETW instrumentation in applications. Of course, ETW instrumentation in the application would allow further diagnosis to be significantly more efficient.

In the first article of our two-part series, we present a high-level overview of the ETW technology and core OS instrumentation. Then, we discuss tool support to obtain and consume OS events. Next, we provide more details on the events from various subcomponents in the core OS. We also explain how the different system events can be combined to produce a comprehensive picture of system behavior, which we demonstrate by using a set of Windows PowerShell scripts.

Event Tracing for Windows

As mentioned earlier, ETW is a logging platform that efficiently records the events sent by software applications or kernel-mode components. Using ETW provider APIs, any application, DLL or driver can become an event provider (a component that raises events) for ETW. A provider first registers with ETW and sends events from various points in the code by inserting ETW logging API calls. Any recordable activity of importance can be an event, and it is represented by a piece of data written by ETW at the time of logging. These logging API calls are ignored when the provider is not enabled. An ETW controller application starts an ETW session and enables certain providers to it. When an enabled event provider makes a logging API call, the event is then directed to the session designated by the controller. Events sent to a session may be stored in a log file, consumed programmatically in real time, or kept in memory until the controller requests a flush of that data to a file. A previous article, "Improve Debugging And Performance Tuning with ETW" (msdn.microsoft.com/en-us/magazine/dvdarchive/cc163437.aspx), has more details about the ETW technology and how to add ETW instrumentation into an application. Over the years, ETW has come to support many different logging modes and features, which are documented on MSDN.

An ETW event consists of a fixed header followed by context-specific data. The header identifies the event and the component raising the event, while the context-specific data ("event payload" hereafter) refers to any additional data that the component raising the event wants to record. When an event raised by a provider is written to an ETW session, ETW adds additional metadata to the header, including thread and process IDs, the current CPU on which the logging thread is running, CPU time usage of the thread, and timestamp. Figure 1 shows an XML representation of an event (a Process event of type Start) as decoded by the tracerpt tool (to be discussed later) in the XML dump file. The <System> section is common to all events and represents the common header that ETW records for each event. This contains timestamp, process and thread ID, provider GUID, CPU time usage, CPU ID, and so on. The <EventData> section displays the logged payload of this event. As shown in Figure 1, a Process Start event from the Windows kernel contains process key (a unique key assigned to each process for identification), process ID, parent process ID, session ID, exit status (valid only for a Process End event), user SID, executable file name of the process, and the command that started the process.

ETW controllers are the applications that use the ETW control API set to start ETW sessions and enable one or more providers to those sessions. They need to give each session a unique name, and on Windows 7 there can be up to 64 sessions running concurrently.

Figure 1 Process Start Event in XML Dump

Collecting Events Using Tools on Windows

There are a few ETW control tools available on Windows that allow users to collect events. For instance, the Performance Monitor exposes ETW control in the form of a data collection set. The EventLog service is also capable of starting and stopping ETW sessions and viewing events. For command-line and script interfaces, logman.exe offers options to perform ETW control operations. For event consumption, the command-line tool tracerpt.exe can consume ETW log files and produce dumps in several formats, including CSV and XML. An example of an XML representation of a Process Start event is shown in Figure 1. In this article, we use logman.exe and tracerpt.exe in the samples we present. The "logman query providers" command in Figure 2 lists the different flags that can be used by a controller when enabling the kernel session.

The following command starts the kernel session and enables process, thread, disk, network, image, and registry events. The collected events will be stored in a file called systemevents.etl in the current directory. Controlling the kernel session and collecting core OS events require administrator privileges: