Wednesday, July 11, 2007

SOS from Your Production Environment

Developing large enterprise applications is a complex and difficult undertaking. Writing the code is just one of many tasks you have to do. You worry about requirements, designs, architecture, unit testing, daily builds, release builds to QC, and many more things. All this effort is spent to create a reliable, scalable, well-performing, and functioning application. Then comes the day where you move it into production (if you are lucky and it is hosted by your own organization) or the customer starts installing it on his servers. This is a big day, a day of celebration. You see the fruits of all your labor and you are excited to see users using the application, getting their feedback, and improving the application. But, too often it starts to haunt you. The customer reports crashes, instability, or unpredictable behavior. You tell yourself, but it is working on our environments. What is different between our test environments and the customer environments?

This is one of the most difficult challenges a development team can face. Your options are suddenly limited. On your development environment, you fire up the VS .NET debugger, set breakpoints, look at the application state, and so forth. Through that, you are finally able to figure out what is going on and then make your code change. But, you tell the customer that you need to install VS .NET to debug this issue. Watch out for the reaction; it might be pretty nasty. Production environments are very locked down and only approved applications can be installed. Very often, any change applied to production needs to go through a stringent test process, which takes time. All this while the end users have to bear with the stability, performance, or functional problems. If this goes on for too long, users will abandon the application and the organization has to fight an uphill battle to convince end users to come back. This creates lots of frustration, noise, problems, and can result in large losses. This is ultimate hell for every developer. You have no idea what is going on while everyone expects a resolution by yesterday.

Gather Data to Make Informed Decisions

Applications can behave very differently in various environments and under load. First, stop worrying about all the shouting. Concentrate on gathering the right data so you can narrow down what is going on. Start with basic information, like which OS and Windows patches are installed. Look at the event log to find out if there are system or application errors reported. If not done automatically, run a virus check too make sure there is no virus infection going on. Enable your custom application logs and comb through them to find out what is happening. If all that does not uncover anything, understand how the application is used: which features are used heavily by users, how many concurrent users are on the system, and so on. Then, replicate a similar environment in house and run a load test against it; this simulates a usage scenario as close as possible (see my article about concurrent users stress testing).

If all that does not bring you closer to a resolution, you need to take a snapshot of the application in production and analyze it. This article will introduce you to the basic approach for this and then point you to more advanced articles. It is easier than most people believe. Microsoft has built a very nice debugging story—in the unmanaged as well as managed world.

The "Debugging Tools for Windows"

Microsoft provides debugging tools for Windows NT 4.0, Windows 2000, Windows XP, and Windows 2003. The homepage for the "Debugging Tools for Windows" can be found here. Follow the "Install Debugging Tools for Windows 32-bit Version" link to download the latest version of them (this article uses version The tools, by default, are installed in the "c:\program files\debugging tools for windows" folder. The install also adds a "Debugging Tools for Windows" menu group under "All Programs." This includes a "Debugging Help" that provides some very good information.

There are a number of debuggers that you can use to debug your application. This article will concentrate on how you can take a dump of your application and then analyze these dumps on another environment and not the production environment itself. You will see how you can take a dump when the application hangs, crashes, or just while it is running. These dumps include a complete memory dump so you can see all the threads executing, all the objects on the stack, and the like. This is the least intrusive approach in really understanding what is happening in your application while used in production. This does also not require any files to be registered; this makes it easier to get permission to use it in production and also to remove again when no longer needed (which the customer might request). Install the debugging tools on any machine you want and then copy the following five files from the "c:\program files\debugging tools for windows" folder to the production environment:

  • adsplus.vbs
  • cdb.exe
  • dbgeng.dll
  • dbghelp.dll
  • tlist.exe

You don't need to register the DLLs. The cdb.exe file is the "Microsoft Console Debugger" and the adsplus.vbs file is a Windows scripting file that is used to automate the CDB debugger. This requires the Windows Scripting Host 5.6 to be installed (run cscript.exe to check the version number). If required, download the version from here and install it on the production server.

Always Create the Symbol Files for Your Binaries

A debugger needs symbol files to show you more then just class, method, and object addresses. Symbols enable debuggers to show you the class names, variable names, and so forth. You can debug an application without symbols, but it is much harder and needs a lot of experience. You want to make your life as easy as possible; therefore, always generate the symbol files. When you compile your application in debug mode, you will see in the same folder where the DLL or EXE gets generated, also a PDB file. The PDB file is the symbol file that you need for debugging purposes. Of course, you do not want to release the debugging version of your binaries. You can tell the compiler also to generate these symbol files when compiling in release mode. Open the project settings in your Visual Studio .NET IDE (menu Project | Settings). Select the Build tab, select "Release" in the Configuration drop-down box if not already selected, and then click on the Advanced button. In the "Debug Info" drop-down box, select "PDB-only." Close your project settings and rebuild your project. You need to do that for all project files. Make it a habit that, when you release your application, you not just release the binaries (DLLs and EXEs) but also all its symbols. Therefore, you have the symbols ready anytime you need them for debugging purposes.

Symbol files contain information such as all the class names, method names, global and local variable names, as well as source line numbers. They are kept separate so that your binaries are smaller and faster when running. Later in the article, I explain how you can load these symbols into the debugger. You can also obtain all the symbols for the Windows OS, the .NET framework, and many other Microsoft products. You can tell the debugger to download it as needed from the Internet or, if you do not have access to the Internet while debugging, you can download them from the Microsoft site (Windows symbols ). The article will explain how to set up your debugger to download Microsoft symbols files as needed.

Using ADPlus to Take Application Dumps

Now, you are ready to take dumps. First, start your application. The article has a ThrowException .NET sample application attached; it allows you to generate two unhandled exceptions. You will use this sample application to walk through all the examples in this article. Next, open the task manager and go to the "Process" tab. Select the "Show processes from all users" check box at the bottom so you can see all processes running. Next, find the process named "ThrowException.exe" and note down the process ID (shown in the PID column).

ADPlus has a number of command line operations. First, you need to decide whether you want to perform a crash dump or hang dump. A crash dump is for situations when your application unexpectedly terminates. Hang dumps can be used to take a dump when your application hangs or any time while it is running. ADPlus cannot be used in scenarios where your application crashes while starting up. It can only be used for applications that are running and then crash. Use the CDB or WinDbg debuggers for scenarios where your application crashes during startup. ADPlus automates the CDB debugger and attaches it to your process. It also can be used to attach it to multiple processes; for example, when your application runs under IIS and uses also COM+. When CDB kicks in, it freezes all processes it has been attached to, takes a dump for each asynchronously, and then lets these processes continue to run.

Running ADPlus in Crash Mode

Open a command prompt and go to the folder where you installed or copied the debugging files. You need to provide at a minimum the following command line arguments when running ADPlus:

  • Mode: The mode you want the CDB debugger to run in. Add "-crash" for crash mode or "-hang" for hang mode.
  • Process to monitor: Add "-p <process id>" to tell CDB which process to attach to. You can repeat that option for each process you want to monitor. For each process, it spawns a separate instance of CDB.
  • Quiet mode: When you run ADPlus, it will show a dialog box at the beginning, telling you which mode has been chosen and where the log files will be created. When you run ADPlus on a remote machine, you need to suppress this dialog box; otherwise, ADPlus itself will hang (see later in the article). Add the option "-quiet".
  • Location of log files: With the "-o <log file path>" option, you can specify the path where the log file will be created. The CDB debugger creates a unique folder each time it runs under that log file path. The folder name will be a combination of the mode and date and time the CDB has been started, for example:
    This guarantees that no dump will be overwritten with another dump. In that folder, you find the actual memory dump as well as a number of log files. The file "ADPlus_report.txt" contains information about the configuration the CDB debugger has been started up with. The "Process_List.txt" file lists information about all the processes running when CDB started. The "PID-<process id>__<process name>__<date>__<time>.log" file contains all the output of the CDB debugger while running. The actual dump generated by CDB gets placed in the "PID-<process id>__<process name>__<...>.dmp" file.
  • Symbol path: The option "-y <path> specifies the path where the symbol files can be found. The path contains three pieces of information:

    • Symbol server: The symbol server to use. This should always be "srv" unless you have a custom symbol server you utilize.
    • Downstream store: The downstream symbol store; for example, "c:\symbols". CDB will cache symbols from the upstream store to the downstream store, providing a cascading symbol store cache.
    • Upstream store: the upstream symbol store. This can be a local path, a network path, or a URL.

    All three pieces of the path should be separated by a "*". The following example points to the public symbol store from Microsoft and uses a local downstream store:

    -y "srv*c:\local symbols*
    This allows you to download CDB the symbols to your local store; this makes it much faster for any subsequent access to the symbol file. Symbols are copied to the downstream store as CDB requires it. So that it doesn't, just go ahead and copy every symbol file. You also can list multiple symbol stores by separating each with a semicolon. The next example points to the Microsoft public symbol store as well as the symbol files of your application:

-y "srv*c:\local symbols*
srv*c:\local symbols*c:\ThrowException\bin\Release"
You also can use the "_NT_SYMBOL_PATH" environment variable instead of using the "-y" option. As mentioned earlier in the article, you can download all the Microsoft symbols if the production environment does not have Internet access. This also means that all your application symbols should be copied to a folder on the production environment. The following article provides a much more comprehensive explanation of the symbol stores and symbol server.
  • Exception mode: Any exception can be raised to the debugger as a first-chance or second-chance exception. First chance exceptions are non-fatal exceptions that are handled by the application. If a first-chance exception is not handled by the application, it gets raised as a second-chance exception. Only debuggers can handle second-chance exceptions. Second-chance exceptions normally cause the application to shut down unless a debugger is attached to it. By default, ADPlus takes a minimum dump for all first-chance exceptions except unknown and EH exceptions (these are quite common and would generate too much overhead). This pauses the thread, and then logs in the log file the exception, thread ID, and call stack of the thread that raised the exception as well as the date and time when the exception occurred. Finally, it takes the mini dump and then resumes the process. The following four command-line options control what action is taken when a first chance or second chance exception happens:

    • Full dump on first-chance exceptions : The "-FullOnFirst" option tells ADPlus to take a full dump for first-chance exceptions.
    • No dump on first-chance exceptions : The "-NoDumpOnFirst" option tells ADPlus to take no dumps at all for first-chance exceptions.
    • Mini dump for second-chance exceptions : By default, ADPlus takes a full dump for second-chance exceptions. The "-MiniOnSecond" option tells ADPlus to take only mini dumps at second-chance exceptions. This is useful when you need to send the dump to someone to look at. These are small dumps, whereas full dumps can be hundreds of megabytes and are difficult to send around.
    • No dump on second exceptions . The "NoDumpOnSecond" option tells ADPlus not to generate any dumps on second-chance exceptions.

  • Notification : The "-notify <machine name> option will send an alert to the machine when a crash dump is taken. This will bring up a message box on the machine and is useful so you don't have to wait till a crash happens.

    For a complete list of all the ADPlus command line arguments, please refer to the topic "ADPlus Command-Line Options" in the "Debugging Help" section. It also explains how you can create a configuration file with all these settings and tell ADPlus with the "-c <configuration file path>" option to use the configuration file instead. Assuming that the application ThrowException runs under the process ID 2828, here is how to start ADPlus in crash mode, logging all information in the "c:\crashlogs" folder.

    ADPlus .crash -p 2828 -o c:\crashlogs
    -y "srv* \symbols*c:\ThrowException;
    srv* c:\symbols*"
    -quiet -FullOnFirst

    This spawns a new window that shows the CDB debugger attached to your application. You can press Ctrl+C in that window any time to take a hang dump if no crash happens. But, this will terminate the process. ADPlus cannot be run in crash mode through Terminal Server on Windows NT 4.0 and Windows 2000. The following article explains how to run in crash mode remotely. It also contains more detailed information about how to use ADPlus.

  • Downloads

  • - Throw Exception VS 2003
  • - Throw Exception VS 2005


    Running ADPlus in Hang Mode

    A hang dump will be taken the moment you run it. The CDB debugger attaches to the process, freezes the process, takes a full dump, detaches again and then resumes the process again. This does not terminate the process at all. The hang mode can be run locally or remotely through Terminal Server. All command line options explained in the previous section apply to the hang mode, except the Exception mode and Notification. Here is a sample of a hang dump:

    ADPlus -hang -p 2828 -o c:\crashlogs
    -y "srv* c:\symbols*c:\ThrowException;
    srv* c:\symbols*
    download/symbols" -quiet

    Analyzing the Dump File

    Taking the dump file is only half the work. Now that you have the dump file, you need to learn how to read it. The remainder of this article will assume that you have taken a full dump. You will also assume that you are using the .NET application attached to this article. Refer to "Debugger Help" if you need to analyze a dump from an unmanaged application. First, copy it off the production environment to another machine so you can analyze it without disturbing the production environment itself. The machine you use to analyze needs to have the "Debugging Tools for Windows" installed.

    Through the start menu "All Programs | Debugging Tools for Windows," start the WinDbg debugger. Everything you will walk through can also be done through CDB, which is console based whereas WinDbg is Windows UI. First, again set the Symbol path. Go to the "File | Symbol File Path" menu. Enter the same as you passed along to ADPlus using the "-y option". In our ycase, this is as follows:


    This means you should also have the symbol files of your application available on the machine where you analyze the dump file. Next, you open the dump file through the "File | Open Crash Dump" menu. This can be either a crash or hang dump. Select the "DMP" file created by ADPlus. When you load a crash dump, it will show that there was a second-chance exception.

    Loading the SOS Debugger Extension

    Before you can start digging into the dump, you need to load the SOS (Son of Strike) extension for .NET. This extension provides an easy way to analyze managed data structures and look at the managed world. You can find the SOS extension at "%windir%\Microsoft.NET\Framework\<.NET version>\sos.dll". The debugging tools include a newer version of the SOS extension for .NET 1.0 and 1.1 that is located at "c:\program files\debugging tools for windows\clr10\sos.dll."

    At the bottom of the WinDbg window, you see a single text box that allows you to type in commands for the debugger. Commands can be started with an exclamation mark or a dot. This article will use the dot notation.

    Now that you have loaded the SOS extension, you can start using its debugger commands to look at the .NET application dump. All the SOS debugger commands need to be started with the exclamation mark. Please note that the commands and options will differ depending on which version of the SOS you load. Of course, the SOS provided as part of the latest version of the debugging tools will provide the most recent commands. Note that the latest debugging tools do not yet contain an updated version of the .NET 2.0 SOS extension. This one should become available with the Beta 2 of .NET soon.

    Digging into the .NET Dump File Using SOS

    Use the "!help" command do display a list of all the available debugger commands provided by the SOS extension. First, you want to see a list of all threads and their status. You can use the "~" (without the exclamation mark) command to list all the unmanaged threads. Of course, you want to look at the managed threads; therefore, you use the "!threads" command. You will see a listing like this:

        ID ThreadOBJ State Domain   APT Exception
    . 0 1 001530f8 6020 00149ff8 STA System.IO.
    DirectoryNotFoundException (00c01c3c)
    2 2 00161248 b220 00149ff8 MTA (Finalizer)

    You can see that thread number zero is a STA thread and it has thrown the exception. Thread number two is a MTA thread that is used by the garbage collector to run any Finalize method if implemented by an object before it gets destroyed. Next, you want to find out more about the exception itself. You use the "!PrintException" command and pass along the address of the exception as shown by the "!threads" command. You will see the following information:

    0:000> !PrintException 00c01c3c
    Exception type: System.IO.DirectoryNotFoundException
    Message: Could not find a part of the path 'w:\MyFile.txt'.
    InnerException: <none>
    StackTrace (generated):
    StackTraceString: <none>
    HResult: 80070003

    You see that a System.IO.DirectoryFoundException has been thrown. YOu also see the exception message, any inner exception, any stack trace if present, and any HResult in case there is a COM object involved. Next, you want to find out more about the stack of the current thread. The current thread is marked in the thread list above with a dot at the beginning, which is the thread that threw the exception. You use the "!clrstack" command to get the stack trace, which will look like following:

    ESP       EIP
    0012f140 77e649d3 [Frame: 12f140]
    0012f180 78cb9da9 System.IO.__Error.WinIOError(Int32, System.String),
    mdToken: 06003489
    0012f1ac 78a8add4 System.IO.FileStream.Init(System.String, ...),
    mdToken: 060035d9
    0012f258 78a8aa13 System.IO.FileStream..ctor(System.String, ...),
    mdToken: 060035d8
    0012f288 78a8d295 System.IO.FileStream..ctor(System.String, ...),
    mdToken: 060035d5
    0012f2b0 78b66e59 System.IO.File.Create(System.String, Int32, Boolean),
    mdToken: 0600355e
    0012f2c4 78b66e05 System.IO.File.Create(System.String),
    mdToken: 0600355c
    0012f2c8 78b68250 System.IO.FileInfo.Create(), mdToken: 0600359b
    0012f2cc 0520039d ThrowException.ThrowException.
    UnhandledException2_Click(...), mdToken:0600000d
    0012f2d8 7b3b5f15 System.Windows.Forms.Control.OnClick(System.EventArgs),
    mdToken: 0600126d
    0012f2e8 7b3ed65d System.Windows.Forms.Button.OnClick(System.EventArgs),
    mdToken: 06001b69
    0012f2f4 7b3ed7a9 System.Windows.Forms.Button.OnMouseUp(...),
    mdToken: 06001b6b
    0012f31c 7b3ba2f7 System.Windows.Forms.Control.WmMouseUp(...),
    mdToken: 06001346
    0012f358 7b363775 System.Windows.Forms.Control.WndProc(...),
    mdToken: 06001359
    0012f374 7b371345 [Frame: 12f374]



  • - Throw Exception VS 2003
  • - Throw Exception VS 2005
  • No comments: