Computer Techno: Advanced Crash Dump Analysis

In this section, we’ll address how to troubleshoot systems that for some reason are not recording a crash dump. One reason why a crash dump might not be recorded is if the paging file on the boot volume is too small to hold the dump. This can easily be remedied by increasing the size of the paging file. A second reason why there might not be a crash dump recorded is because the kernel code and data structures needed to write the crash dump have been corrupted at the time of the crash. As described earlier, this data is checksummed when the system boots, and if the checksum made at the time of the crash does not match, the system does not even attempt to save the crash dump (so as not to risk corrupting data on the disk). So in this case, you need to catch the system as it crashes and then try to determine the reason for the crash.

Another reason occurs when the disk subsystem for the system disk is not able to process disk write requests (a condition that might have triggered the system failure itself). One such condition would be a hardware failure in the disk controller or maybe a cabling issue near the hard disk.

Yet another possibility occurs when the system has drivers that have registered to add secondary dump data to the dump file. When the driver callbacks are called, they might incorrectly access data structures located in paged memory (for example), which will lead to a second crash.

One simple option is to turn off the Automatically Restart option in the Startup And Recovery settings so that if the system crashes, you can examine the blue screen on the console. However, only the most straightforward crashes can be solved from just the blue-screen text.

To perform more in-depth analysis, you need to use the kernel debugger to look at the system at the time of the crash. This can be done by booting the system in debugging mode, which is described in the previous section. When a system is booted in debugging mode and crashes, instead of painting the blue screen and attempting to record the dump, it will wait forever until a host kernel debugger is connected. In this way, you can see the reason for the crash and perhaps perform some basic analysis using the kernel debugger commands described earlier. As mentioned in the previous section, you can use the .dump command in the debugger to save a copy of the crashed system’s memory space for later debugging, thus allowing you to reboot the crashed system and debug the problem offline.

The operating system code and data structures that handle processor exceptions can become corrupted such that a series of recursive faults occur. One example of this would be if the operating system trap handler got corrupted and caused a page fault. This would invoke the page fault handler, which would fault again, and so on. If such a situation occurred, the system would be hopelessly stuck. To prevent such a situation from occurring, CPUs have a builtin recursive fault protection mechanism, which sets a hard limit on the depth of a recursive fault. On most x86 processors, a fault can nest to two levels deep. When the third recursive fault occurs, the processor resets itself and the machine reboots. This is called a triple fault. This can happen when there’s a faulty hardware component as well. Even a kernel debugger won’t be invoked in a triple fault situation. However, sometimes the mere fact that the kernel debugger doesn’t activate can confirm that there’s a problem with newly added hardware or drivers.

You can use the kernel debugger to trigger a triple fault on a machine by setting a breakpoint on the kernel debugger dispatch routine KiDispatchException. This happens because the exception dispatcher now causes a breakpoint exception, which invokes the exception dispatcher, and so on.

Source of Information : Microsoft Press Windows Internals 5th Edition

Computer Techno

Advanced Crash Dump Analysis - When There Is No Crash Dump

No comments:

Cloud storage is for blocks too, not just files