Enter your email address:

Delivered by FeedBurner

Advanced Crash Dump Analysis - When There Is No Crash Dump


In this section, we’ll address how to troubleshoot systems that for some reason are not recording a crash dump. One reason why a crash dump might not be recorded is if the paging file on the boot volume is too small to hold the dump. This can easily be remedied by increasing the size of the paging file. A second reason why there might not be a crash dump recorded is because the kernel code and data structures needed to write the crash dump have been corrupted at the time of the crash. As described earlier, this data is checksummed when the system boots, and if the checksum made at the time of the crash does not match, the system does not even attempt to save the crash dump (so as not to risk corrupting data on the disk). So in this case, you need to catch the system as it crashes and then try to determine the reason for the crash.

Another reason occurs when the disk subsystem for the system disk is not able to process disk write requests (a condition that might have triggered the system failure itself). One such condition would be a hardware failure in the disk controller or maybe a cabling issue near the hard disk.

Yet another possibility occurs when the system has drivers that have registered to add secondary dump data to the dump file. When the driver callbacks are called, they might incorrectly access data structures located in paged memory (for example), which will lead to a second crash.

One simple option is to turn off the Automatically Restart option in the Startup And Recovery settings so that if the system crashes, you can examine the blue screen on the console. However, only the most straightforward crashes can be solved from just the blue-screen text.

To perform more in-depth analysis, you need to use the kernel debugger to look at the system at the time of the crash. This can be done by booting the system in debugging mode, which is described in the previous section. When a system is booted in debugging mode and crashes, instead of painting the blue screen and attempting to record the dump, it will wait forever until a host kernel debugger is connected. In this way, you can see the reason for the crash and perhaps perform some basic analysis using the kernel debugger commands described earlier. As mentioned in the previous section, you can use the .dump command in the debugger to save a copy of the crashed system’s memory space for later debugging, thus allowing you to reboot the crashed system and debug the problem offline.

The operating system code and data structures that handle processor exceptions can become corrupted such that a series of recursive faults occur. One example of this would be if the operating system trap handler got corrupted and caused a page fault. This would invoke the page fault handler, which would fault again, and so on. If such a situation occurred, the system would be hopelessly stuck. To prevent such a situation from occurring, CPUs have a builtin recursive fault protection mechanism, which sets a hard limit on the depth of a recursive fault. On most x86 processors, a fault can nest to two levels deep. When the third recursive fault occurs, the processor resets itself and the machine reboots. This is called a triple fault. This can happen when there’s a faulty hardware component as well. Even a kernel debugger won’t be invoked in a triple fault situation. However, sometimes the mere fact that the kernel debugger doesn’t activate can confirm that there’s a problem with newly added hardware or drivers.

You can use the kernel debugger to trigger a triple fault on a machine by setting a breakpoint on the kernel debugger dispatch routine KiDispatchException. This happens because the exception dispatcher now causes a breakpoint exception, which invokes the exception dispatcher, and so on.

Source of Information : Microsoft Press Windows Internals 5th Edition

Advanced Crash Dump Analysis - Hung or Unresponsive Systems


If a system becomes unresponsive (that is, you are receiving no response to keyboard or mouse input), the mouse freezes, or you can move the mouse but the system doesn’t respond to clicks, the system is said to have hung. A number of things can cause the system to hang:

• A device driver does not return from its interrupt service (ISR) routine or deferred procedure call (DPC) routine

• A high priority real-time thread preempts the windowing system driver’s input threads

• A deadlock (when two threads or processors hold resources each other wants and neither will yield what they have) occurs in kernel mode

You can check for deadlocks by using the Driver Verifier option called deadlock detection. Deadlock detection monitors the use of spinlocks, fast mutexes, and mutexes, looking for patterns that could result in a deadlock. If one is found, the Driver Verifier crashes the system with an indication of which driver causes the deadlock. The simplest form of deadlock occurs when two threads hold resources each other thread wants and neither will yield what they have or give up waiting for the one they want. The first step to troubleshooting hung systems is therefore to enable deadlock detection on suspect drivers, then unsigned drivers, and then all drivers, until you get a crash that pinpoints the driver causing the deadlock.

There are two ways to approach a hanging system so that you can apply the manual crash troubleshooting techniques to determine what driver or component is causing the hang: the first is to crash the hung system and hope that you get a dump that you can analyze, and the second is to break into the system with a kernel debugger and analyze the system’s activity. Both approaches require prior setup and a reboot. You use the same exploration of system state with both approaches to try and determine the cause of the hang.

To manually crash a hung system, you must first add the DWORD registry value HKLM\
SYSTEM\CurrentControlSet\Services\i8042prt\Parameters\CrashOnCtrlScroll and set it to 1. After rebooting, the i8042 port driver, which is the port driver for PS2 keyboard input, monitors keystrokes in its ISR looking for two presses of the scroll lock key while the right control key is depressed. When the driver sees that sequence, it calls KeBugCheckEx with the MANUALLY_INITIATED_CRASH (0xE2) stop code that indicates a manually initiated crash. When the system reboots, open the crash dump file and apply the techniques mentioned earlier to try and determine why the system was hung (for example, determining what thread was running when the system hung, what the kernel stack indicates was happening, and so on). Note that this works for most hung system scenarios, but it won’t work if the i8042 port driver’s ISR doesn’t execute. (The i8042 port driver’s ISR won’t execute if all processors are hung as a result of their IRQL being higher than the ISR’s IRQL, or if corruption of system data structures extends to interrupt-related code or data.)

You can also trigger a crash if your hardware has a built-in “crash” button. (Some high-end servers have this.) In this case, the crash is initiated by signaling the nonmaskable interrupt (NMI) pin of the system’s motherboard. To enable this, set the registry DWORD value HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\NMICrashDump to 1. Then, when you press the dump switch, an NMI is delivered to the system and the kernel’s NMI interrupt handler calls KeBugCheckEx. This works in more cases than the i8042 port driver mechanism because the NMI IRQL is always higher than that of the i8042 port driver interrupt. See www.microsoft.com/whdc/system/sysinternals/dmpsw.mspx for more information.

If you are unable to manually generate a crash dump, you can attempt to break into the hung system by first making the system boot into debugging mode. You do this in one of two ways. You can press the F8 key during the boot and select Debugging Mode, or you can create a debugging-mode boot option in the BCD by copying an existing boot entry and adding the debug option. When using the F8 approach, the system will use the default connection (Serial Port COM2 and 19200 Baud), but you can use the F10 key to display the Edit Boot Options screen to edit debug-related boot options. With the debug option, you must also configure the connection mechanism to be used between the host system running the kernel debugger and the target system booting in debugging mode and then configure the debugport and baudrate switches appropriately for the connection type. The three connection types are a null modem cable using a serial port, an IEEE 1394 (FireWire) cable using 1394 ports on each system, or a USB 2.0 host-to-host cable using USB ports on each system. For details on configuring the host and target system for kernel debugging, see the Debugging Tools for Windows help file.

When booting in debugging mode, the system loads the kernel debugger at boot time and makes it ready for a connection from a kernel debugger running on a different computer connected through a serial cable, IEEE 1394 cable, or USB 2.0 host-to-host cable. Note that the kernel debugger’s presence does not affect performance. When the system hangs, run the WinDbg or Kd debugger on the connected system, establish a kernel debugging connection, and break into the hung system. This approach will not work if interrupts are disabled or the kernel debugger has become corrupted.

Instead of leaving the system in its halted state while you perform analysis, you can also use the debugger .dump command to create a crash dump file on the host debugger machine. Then you can reboot the hung system and analyze the crash dump offline (or submit it to Microsoft). Note that this can take a long time if you are connected using a serial null modem cable or USB 2.0 connection (versus a higher speed 1394 connection), so you might want to just capture a minidump using the .dump /m command. Alternatively, if the target machine is capable of writing a crash dump, you can force it to do so by issuing the .crash command from the debugger. This will cause the target machine to create a dump on its local hard drive that you can examine after the system reboots.

You can cause a hang by running Notmyfault and selecting the Hang option. This causes the Myfault driver to queue a DPC on each processor of the system that executes an infinite loop. Because the IRQL of the processor while executing DPC functions is DPC/dispatch level, the keyboard ISR will respond to the special keyboard crashing sequence.

Once you’ve broken into a hung system or loaded a manually generated dump from a hung system into a debugger, you should execute the !analyze command with the –hang option. This causes the debugger to examine the locks on the system and try to determine whether there’s a deadlock, and if so, what driver or drivers are involved. However, for a hang like the one that Notmyfault’s Hang option generates, the !analyze analysis command will report nothing useful.

If the !analyze command doesn’t pinpoint the problem, execute !thread and !process in each of the dump’s CPU contexts to see what each processor is doing. (Switch CPU contexts with the ~ command—for example, use ~1 to switch to processor 1’s context.) If a thread has hung the system by executing in an infinite loop at an IRQL of DPC/dispatch level or higher, you’ll see the driver module in which it has become stuck in the stack trace of the !thread command. The stack trace of the crash dump you get when you crash a system experiencing the Notmyfault hang bug looks like this:

f9e66ed8 f9b0d681 000000e2 00000000 00000000 nt!KeBugCheckEx+0x19
f9e66ef4 f9b0cefb 0069b0d8 010000c6 00000000 i8042prt!I8xProcessCrashDump+0x235
f9e66f3c 804ebb04 81797d98 8169b020 00010009 i8042prt!I8042KeyboardInterruptService+0x21c
f9e66f3c fa12e34a 81797d98 8169b020 00010009 nt!KiInterruptDispatch+0x3d
WARNING: Stack unwind information not available. Following frames may be wrong.
ffdff980 8169b288 f9e67000 0000210f 00000004 myfault+0x34a
8054ace4 ffdff980 804ebf58 00000000 0000319c 0x8169b288
8054ace4 ffdff980 804ebf58 00000000 0000319c 0xffdff980
8169ae9c 8054ace4 f9b12b0f 8169ac88 00000000 0xffdff980

The top few lines of the stack trace reference the routines that execute when you type the i8042 port driver’s crash key sequence. The presence of the Myfault driver indicates that it might be responsible for the hang. Another command that might be revealing is !locks, which dumps the status of all executive resource locks. By default, the command lists only resources that are under contention, which means that they are both owned and have at least one thread waiting to acquire them. Examine the thread stacks of the owners with the !thread command to see what driver they might be executing in. Sometimes you will find that the owner of one of the locks is waiting for an IRP to complete (a list of IRPs related to a thread is displayed in the !thread output). In these cases it is very hard to tell why an IRP is not making forward progress. (IRPs are usually queued to privately managed driver queues before they are completed). One thing you can do is examine the IRP with the !irp command and find the driver that pended the IRP (it will have the word “pending” displayed in its stack location from the !irp output). Once you have the driver name, you can use the !stacks command to look for other threads that the driver might be running on, which often provides clues about what the lock-owning driver is doing. Much of the time you will find the driver is deadlocked or waiting on some other resource that is blocked waiting for the driver.

Source of Information : Microsoft Press Windows Internals 5th Edition

Advanced Crash Dump Analysis - Stack Trashes


Stack overrun or stack trashing typically results from a buffer overrun or underrun or when a driver passes a buffer address located on the stack to a lower driver on the device stack, which then performs the work asynchronously.

In the case of a buffer overrun or underrun, instead of residing in pool, as you saw with Notmyfault’s buffer overrun bug, the target buffer is on the stack of the thread that executes the bug. This type of bug is another one that’s difficult to debug because the stack is the foundation for any crash dump analysis.

In the case of passing buffers on the stack to lower drivers, if the lower driver returns to the caller immediately because it used a completion routine to perform the work, instead of returning synchronously, when the completion routine is called, it will use the stack address that was passed previously, which could now correspond to a different state on the caller’s stack and result in corruption.

When you run Notmyfault and select Stack Trash, the Myfault driver overruns a buffer it allocates on the kernel stack of the thread that executes it. When Myfault tries to return control to the Ntoskrnl function that was invoked, it reads the return address, which is the address at which it should continue executing, from the stack. The address was corrupted by the stackbuffer overrun, so the thread continues execution at some different address in memory—an address that might not even contain code. An illegal exception and crash occur when the thread executes an illegal CPU instruction or it references invalid memory. The driver that the crash dump analysis of a stack overrun points the blame at will vary from crash to crash, but the stop code will almost always be KMODE_EXCEPTION_NOT_HANDLED. If you execute a verbose analysis, the stack trace looks like this:

881fc744 81c82590 0000008e c0000005 00000000 nt!KeBugCheckEx+0x1e
881fcb14 81ca45da 881fcb30 00000000 881fcb84 nt!KiDispatchException+0x1a9
881fcb7c 81ca458e 881fcc44 00000000 badb0d00 nt!CommonDispatchException+0x4a
881fcc2c 81d07fd3 9762b658 84736e68 84736e68 nt!Kei386EoiHelper+0x186
881fcc44 81e98615 99321810 84736e68 84736ed8 nt!IofCallDriver+0x63
881fcc64 81e98dba 9762b658 99321810 00000000 nt!IopSynchronousServiceTail+0x1d9
881fcd00 81e82a8d 9762b658 84736e68 00000000 nt!IopXxxControlFile+0x6b7
881fcd34 81ca3a1a 0000007c 00000000 00000000 nt!NtDeviceIoControlFile+0x2a
881fcd34 779e9a94 0000007c 00000000 00000000 nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
0012f9f4 00000000 00000000 00000000 00000000 0x779e9a94

Notice how the call to IofCallDriver leads immediately to Kei386EoiHelper and into an exception, instead of a driver’s IRP dispatch routine. This is consistent with the stack having been corrupted and the IRP dispatch routine causing an exception when attempting to return to its caller by referencing a corrupted return address. Unfortunately, mechanisms like special pool and system code write protection can’t catch this type of bug. Instead, you must take some manual analysis steps to determine indirectly which driver was operating at the time of the corruption. One way is to examine the IRPs that are in progress for the thread that was executing at the time of the stack trash. When a thread issues an I/O request, the I/O manager stores a pointer to the outstanding IRP on the IRP list of the ETHREAD structure for the thread. The !thread debugger command dumps the IRP list of the target thread. (If you don’t specify a thread object address, !thread dumps the processor’s current thread.) Then you can look at the IRP with the !irp command:

lkd> !thread
THREAD 858d1aa0 Cid 0248.02c0 Teb: 7ffd9000 Win32Thread: ffad4e90 RUNNING on processor 0
IRP List:
bc5a7f68: (0006,0094) Flags: 00000000 Mdl: 00000000
Not impersonating
Attached Process 84f45d90

lkd> !irp bc5a7f68
Irp is active with 1 stacks 1 is current (= 0x837a7ab8)
No Mdl Thread 858d1aa0: Irp stack trace.
cmd flg cl Device File Completion-Context
>[ e, 0] 0 0 856f6378 8504f290 00000000-00000000
\Driver\MYFAULT Args: 00000000 00000000 83360010 00000000

The output shows that the IRP’s current and only stack location (designated with the “>” prefix) is owned by the Myfault driver. If this were a real crash, the next steps would be to ensure that the driver version installed is the most recent available, install the new version if it isn’t, and if it is, to enable the Driver Verifier on the driver (with all settings except low memory simulation).

Manually analyzing the stack is often the most powerful technique when dealing with crashes such as these. Typically, this involves dumping the current stack pointer register (for example, esp and rsp on 32-bit and x64 respectively). However, because the code responsible for crashing the system itself might modify the stack in ways that make analysis difficult, the processor responsible for crashing the system provides a backing store for the current data in the stack, called KiPreBugcheckStackSaveArea, which contains a copy of the stack before any code in KeBugCheckEx executes. By using the dps (dump pointer with symbols) command in the debugger, you can dump this area (instead of the CPU’s stack pointer register) and resolve symbols in an attempt to discover any potential stack traces. In this crash, here’s what dumping the stack area eventually revealed on a 32-bit system.

kd> dps KiPreBugcheckStackSaveArea KiPreBugcheckStackSaveArea+3000
81d7dd20 881fcc44
81d7dd24 98fcf406 myfault+0x406
81d7dd28 badb0d00

Although this data was located among many other different functions, it is of special interest because it mentions a function in the Myfault driver, which as we’ve seen was currently executing an IRP, that doesn’t show on the stack.

Source of Information : Microsoft Press Windows Internals 5th Edition

Advanced Crash Dump Analysis


The preceding section leverages the Driver Verifier to create crashes that the debugger’s automated analysis engine can resolve. You might still encounter cases where you cannot get a system to produce easily analyzable crashes and, if so, you will need to execute manual analysis to try and determine what the problem is. Here are some examples of basic commands that can provide clues during crash analysis. The Debugging Tools for Windows help file provides complete documentation on these and other commands as well as examples of how to use them during crash analysis:

• Use the !process 0 0 debugger command to look at the processes running, and make sure that you understand the purpose of each one. Try disabling or uninstalling unnecessary applications and services.

• Use the lm command with the kv option to list the loaded kernel-mode drivers. Make sure that you understand the purpose of any third-party drivers and that you have the most recent versions.

• Use the !vm command to see whether the system has exhausted virtual memory, paged pool, or nonpaged pool. If virtual memory is exhausted, the committed pages will be close to the commit limit, so try to identify a potential memory leak by examining the list of processes to see which one reports high commit usage. If nonpaged pool or paged pool is exhausted (that is, the usage is close to the maximum).

There are other debugging commands that can prove useful, but more advanced knowledge is required to apply them. The !irp command is one of them.

Source of Information : Microsoft Press Windows Internals 5th Edition

Buffer Overrun, Memory Corruptions, and Special Pool


By far the most common source of crashes on Windows is pool corruption. Pool corruption usually occurs when a driver suffers from a buffer overrun or buffer underrun bug that causes it to overwrite data past either the end or start of a buffer it has allocated from paged or nonpaged pool. The Executive’s pool-tracking structures reside on either side of a pool buffer and separate buffers from each other. These bugs, therefore, cause corruption to the pool tracking structures, to buffers owned by other drivers, or to both. You can often catch the culprit of a pool overrun by using the !pool command to examine the surrounding pool tags. Find the address at which the corruption occurred and use !pool address_of_corruption. This command will display all the pool allocations that are on the same page as the corruption. Looking in the left column, find the range of the corrupted address and then look at the allocation just previous to it and find its pool tag. This will likely be the culprit in a buffer overrun. You can use the pooltag.txt file in the Triage folder of the Debugging Tools for Windows installation directory to find the driver that owns the pool tag, or use the Strings utility from Sysinternals.

Pool corruption can also occur when a driver writes to pool it had previously owned but subsequently freed. This is called a use after free bug and is usually caused by a race condition in a driver. These bugs are particularly hard to debug because the driver that corrupts memory no longer has any traceable ties to the memory, such as a neighboring pool tag as in a buffer overrun. Another fairly common cause of pool corruption is direct memory access (DMA). DMA occurs when hardware writes directly to RAM instead of going through a driver; however, the driver is still responsible for coordinating the whole process by allocating the memory that the hardware will write to and programming the hardware registers of the device with the details of the operation. If a driver has a bug that releases the memory it is using for DMA before the hardware writes to it, the memory can be given to another driver or even to a user-mode application, which will certainly not expect to have hardware writing to it.

The crashes caused by pool corruption are virtually impossible to debug because the system crashes when corrupted data is referenced, not when the corruption occurs. However, sometimes you can take steps to at least obtain a clue about what corrupted the memory. The first step is to try to determine the size of the corruption by looking at the corrupted data. If the corruption is a single bit, it was likely caused by bad RAM. If the corruption is fairly small, it could be caused by hardware or software, and finding a root cause will be nearly impossible. In the case of large corruptions, you can look for patterns in the corruption, like strings (for example, HTTP packet payloads, file contents of text-based files, and so on) or audio/video data (usually patterns of integers less than 1,024). Open an MP3 file in a hex editor to get an idea of what audio data looks like in memory.

You can generate a pool corruption crash by running Notmyfault and selecting the Buffer Overflow bug. This causes Myfault to allocate a buffer and then overwrite the 40 bytes following the buffer. There can be a significant delay between the time you click the Do Bug button and when a crash occurs, and you might even have to generate pool usage by exercising applications before a crash occurs, which highlights the distance between a corruption and its effect on system stability. An analysis of the resultant crash almost always reports Ntoskrnl or another driver as being the likely cause, which demonstrates the usefulness of a verbose analysis with its description of the stop code:

An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is
caused by drivers that have corrupted the system pool. Run the driver
verifier against any new (or suspect) drivers, and if that doesn’t turn up
the culprit, then use gflags to enable special pool.
Arg1: 4f4f4f53, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, value 0 = read operation, 1 = write operation
Arg4: 81926886, address which referenced memory

The advice in the description is to run the Driver Verifier against any new or suspect drivers or to use Gflags to enable special pool. Both accomplish the same thing: to have the system detect a potential corruption when it occurs and crash the system in a way that makes the automated analysis point at the driver causing the corruption.

If the Driver Verifier’s special pool option is enabled, verified drivers use special pool, rather than paged or nonpaged pool, for any allocations they make for buffers slightly less than a page in size. A buffer allocated from special pool is sandwiched between two invalid pages and by default is aligned against the top of the page. The special pool routines also fill the unused portions of the page in which the buffer resides with a random pattern.

The system detects any buffer overruns of under a page in size at the time of the overrun because they cause a page fault on the invalid page following the buffer. The signature serves to catch buffer underruns at the time the driver frees a buffer because the integrity of the pattern placed there at the time of allocation will have been compromised.

To see how the use of special pool causes a crash that the analysis engine easily diagnoses, run the Driver Verifier Manager. Choose the Create Custom Settings (For Code Developers) option on the first page of the wizard, choose Select Individual Settings From A Full List on the second, and then select Special Pool. Choose the Select Drivers From A List option on the subsequent page, and on the page that lists drivers press the button to add unloaded drivers, and then type myfault.sys into the File Find dialog box. (You do not have to find myfault.sys in the File Find dialog box; just enter its name.) Then check the myfault.sys driver, exit the wizard, and reboot.

When you run Notmyfault and cause a buffer overflow, the system will immediately crash and the analysis of the dump reports this:

Probably caused by : myfault.sys ( myfault+3f1 )

A verbose analysis describes the stop code like this:

N bytes of memory was allocated and more than N bytes are being referenced.
This cannot be protected by try-except.
When possible, the guilty driver’s name (Unicode string) is printed on
the bugcheck screen and saved in KiBugCheckDriver.
Arg1: beb50000, memory referenced
Arg2: 00000001, value 0 = read operation, 1 = write operation
Arg3: ec3473f1, if non-zero, the address which referenced memory.
Arg4: 00000000, (reserved)

Special pool made an elusive bug into one that instantly reveals itself and makes the analysis trivial.

Source of Information : Microsoft Press Windows Internals 5th Edition

Code Overwrite and System Code Write Protection


A driver with a bug that causes corruption or misinterpretation of its own data structures can reference memory the driver doesn’t own when it interprets corrupted data as a memory pointer value. The target of the pointer can be anything in the virtual address space, including data belonging to other drivers, invalid memory, or the code of other drivers or the kernel. As with buffer overruns, by the time that corruption is detected and the system crashes, it’s usually impossible to identify the driver that caused the corruption. Enabling special pool increases the chance of catching wild-pointer bugs, but it does not catch code corruption.

When you run Notmyfault and select the Code Overwrite option, the Myfault driver corrupts the entry point to the NtReadFile kernel function. One of two things will happen at this point: if your system has 255 MB or less of physical memory, you’ll get a crash for which an analysis points at Myfault.sys. The stop code description that a verbose analysis displays tells you that Myfault attempted to write to read-only memory:

An attempt was made to write to readonly memory. The guilty driver is on the
stack trace (and is typically the current instruction pointer).
When possible, the guilty driver’s name (Unicode string) is printed on
the bugcheck screen and saved in KiBugCheckDriver.
Arg1: 804bb7fd, Virtual address for the attempted write.
Arg2: 004bb121, PTE contents.
Arg3: b804db60, (reserved)
Arg4: 0000000b, (reserved)

However, if you have more than 255 MB of memory, you’ll get a different type of crash because the attempt to corrupt the memory isn’t caught. Because NtReadFile is a commonly executed system service that is used by the Windows subsystem to read keyboard and mouse input, the system will almost immediately crash as a thread attempts to execute the corrupted code and generates an illegal instruction fault. The analysis of crashes generated with this bug is always wrong, but it might vary, with Win32k.sys and Ntoskrnl.exe commonly being the analyzer’s best guess as to what’s responsible. The bugcheck description for these crashes is:

This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Arg1: c0000005, The exception code that was not handled
Arg2: 80461885, The address that the exception occurred at
Arg3: 00000000, Parameter 0 of the exception
Arg4: 00000000, Parameter 1 of the exception

The reason for the different behaviors on different configurations relates to a mechanism called system code write protection. If system code write protection is enabled, the memory manager maps Ntoskrnl.exe, the HAL, and boot drivers using standard physical pages (4 KB on x86 and x64, and 8 KB on IA64). Because the granularity of protection in an image is the standard page size, the memory manager can write-protect code pages so that an attempt to modify them generates an access fault (as seen in the first crash). However, when system code write protection is disabled on systems with more than 255 MB of RAM, the memory manager uses large pages (4 MB on x86, and 16 MB on IA64 and x86-64) to map Ntoskrnl.exe and the HAL.

If system code write protection is off and crash analysis reports unlikely causes for a crash or you suspect code corruption, you should enable it. Verifying at least one driver with the Driver Verifier is the easiest way to enable it. You can also enable it manually by adding two registry values under HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management. First, specify the amount of RAM at which the memory manager uses large pages instead of standard pages to map Ntoskrnl.exe as an effectively infinite value. You do this by creating a DWORD value called LargePageMinimum and setting it to 0xFFFFFFFF. Then add another DWORD value named EnforceWriteProtection and set it to 1. You must reboot for the changes to take effect.

When the debugger has access to the image files included in a crash dump, the analysis internally executes the !chkimg debugger command to verify that a copy of an image in a crash dump matches the on-disk image and reports any differences. Note that chkimg will always report discrepancies in Ntoskrnl.exe if you’ve enabled the Driver Verifier.

Source of Information : Microsoft Press Windows Internals 5th Edition

Using Crash Troubleshooting Tools


The crash generated in the preceding section with Notmyfault’s High IRQL Fault (Kernelmode) option poses no challenge for the debugger’s automated analysis. Unfortunately, most crashes are not so easy and sometimes are impossible to debug. There are several levels of increasing severity in terms of system performance degradation that might help turn system crashes that cannot be analyzed into ones that can be. If the crashes generated after you configure a level and reboot aren’t revealing the cause, try the next level.

1. If there are one or more drivers you consider likely sources of the crashes—because they were introduced into the system relatively recently, they were recently updated, or the circumstances of the crash implicate them—enable them for verification using the Driver Verifier and check all the verification options except for low resources simulation.

2. Enable the same level of verification as in level 1 on all unsigned drivers in the system.

3. Enable the same verification as in level 1 on all drivers in the system. To maintain reasonable performance, you may want to divide the drivers into groups, enabling the Driver Verifier on one group at a time between reboots

Obviously, before you spend time and energy making system configuration changes and analyzing crashes, you should ensure that your system’s kernel and drivers are the most recent available by using the services of Windows Update and third-party driver support sites.

If your system becomes unbootable because the Driver Verifier detects a driver error and crashes the system, then start in safe mode (where verification is disabled), run the Driver Verifier, and delete verification settings.

Source of Information : Microsoft Press Windows Internals 5th Edition

Basic Crash Dump Analysis


If OCA fails to identify a resolution or you are unable to submit the crash to OCA, an alternative is analyzing crashes yourself. As mentioned earlier, WinDbg and Kd both execute the same analysis engine used by OCA when you load a crash dump file, and the basic analysis can sometimes pinpoint the problem. So you might be fortunate and have the crash dump solved by the automatic analysis. But if not, there are some straightforward techniques to try to solve the crash.

This section explains how to perform basic crash analysis steps, followed by tips on leveraging the Driver Verifier to catch buggy drivers when they corrupt the system so that a crash dump analysis pinpoints them.

OCA’s automated analysis may occasionally identify a highly likely cause of a crash but not be able to inform you of the suspected driver. This happens because it only reports the cause for crashes that have their bucket ID entry populated in the OCA database, and entries are created only when Microsoft crash-analysis engineers have verified the cause. If there’s no bucket ID entry, OCA reports that the crash was caused by “unknown driver.”

You can use the Notmyfault utility from Windows Sysinternals (www.microsoft.com/technet/sysinternals) to generate the crashes described here. Notmyfault consists of an executable named Notmyfault.exe and a driver named Myfault.sys. When you run the Notmyfault executable, it loads the driver and presents the dialog, which allows you to crash the system in various ways or to cause the driver to leak paged pool. The crash types offered represent the ones most commonly seen by Microsoft’s product support services. Selecting an option and clicking the Do Bug button causes the executable to tell the driver, by using the DeviceIoControl Windows API, which type of bug to trigger.

You should execute Notmyfault crashes on a test system or on a virtual machine because there is a small risk that memory it corrupts will be written to disk and result in file or disk corruption.

The names of the Notmyfault executable and driver highlight the fact that user mode cannot directly cause the system to crash. The Notmyfault executable can cause a crash only by loading a driver to perform an illegal operation for it in kernel mode.

Basic Crash Dump Analysis
The most straightforward Notmyfault crash to debug is the one caused by selecting the High IRQL Fault (Kernelmode) option and clicking the Do Bug button. This causes the driver to allocate a page of paged pool, free the pool, raise the IRQL to above DPC/dispatch level, and then touch the page it has freed. If that doesn’t cause a crash, the process continues by reading memory past the end of the page until it causes a crash by accessing invalid pages. The driver performs several illegal operations as a result:

1. It references memory that doesn’t belong to it.

2. It references paged pool at an IRQL that’s DPC/dispatch level or higher, which is illegal because page faults are not permitted when the processor IRQL is DPC/dispatch level or higher.

3. When it goes past the end of the memory that it had allocated, it tries to reference memory that is potentially invalid. The reason the first page reference might not cause a crash is that it won’t generate a page fault if the page that the driver frees remains in the system working set. When you load a crash generated with this bug into WinDbg, the tool’s analysis displays something like this:

Microsoft (R) Windows Debugger Version 6.9.0003.113 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\windows\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available

Symbol search path is: srv*c:\programming\symbols\*http://msdl.microsoft.com/download/
Executable search path is:
Windows Server 2008 Kernel Version 6001 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 6001.18063.x86fre.vistasp1_gdr.080425-1930
Kernel base = 0x81804000 PsLoadedModuleList = 0x8191bc70
Debug session time: Sun Sep 21 22:58:19.994 2008 (GMT-4)
System Uptime: 2 days 0:11:17.876
Loading Kernel Symbols
Loading User Symbols
Loading unloaded module list
* *
* Bugcheck Analysis *
* *

Use !analyze -v to get detailed debugging information.

BugCheck D1, {a35db800, 1c, 0, 9879c3dd}

*** ERROR: Module load completed but symbols could not be loaded for myfault.sys
Probably caused by : myfault.sys ( myfault+3dd )

Followup: MachineOwner

The first thing to note is that WinDbg reports errors trying to load symbols for Myfault.sys and Notmyfault.exe. These are expected because the symbol files for Myfault.sys and Notmyfault.exe are not on the symbol-file path (which is configured to point at the Microsoft symbol server). You’ll see similar errors for third-party drivers and executables that do not ship with the operating system.

The analysis text itself is terse, showing the numeric stop code and bug-check parameters followed by a “Probably caused by” line that shows the analysis engine’s best guess at the offending driver. In this case it’s on the mark and points directly at Myfault.sys, so there’s no need for manual analysis.

The “Followup” line is not generally useful except within Microsoft, where the debugger looks for the module name in the Triage.ini file that’s located within the Triage directory of the Debugging Tools for Windows installation directory. The Microsoft-internal version of that file lists the developer or group responsible for handling crashes in a specific driver, and the debugger displays the developer’s or group’s name in the Followup line when appropriate.

Verbose Analysis
Even though the basic analysis of the Notmyfault crash identifies the faulty driver, you should always have the debugger execute a verbose analysis by entering the command:

!analyze –v

The first obvious difference between the verbose and default analysis is the description of the stop code and its parameters. Following is the output of the command when executed on the same dump:

An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arg1: a35db800, memory referenced
Arg2: 0000001c, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: 9879c3dd, address which referenced memory

This saves you the trouble of opening the help file to find the same information, and the text sometimes suggests troubleshooting steps, an example of which you’ll see in the next section on advanced crash dump analysis. The other potentially useful information in a verbose analysis is the stack trace of the thread that was executing on the processor that crashed at the time of the crash. Here’s what it looks like for the same dump:

80395b78 9879c3dd badb0d00 8312d054 00000003 nt!KiTrap0E+0x2ac
WARNING: Stack unwind information not available. Following frames may be wrong.
80395c44 81a505e5 855802e0 849e26c0 849e2730 myfault+0x3dd
80395c64 81a50d8a 83746238 855802e0 00000000 nt!IopSynchronousServiceTail+0x1d9
80395d00 81a3aa61 83746238 849e26c0 00000000 nt!IopXxxControlFile+0x6b7
80395d34 8185ba7a 0000007c 00000000 00000000 nt!NtDeviceIoControlFile+0x2a
80395d34 770f9a94 0000007c 00000000 00000000 nt!KiFastCallEntry+0x12a
0012f4a0 77e84c9b 0000007c 00000000 00000000 ntdll!ZwDeviceIoControlFile+0xb
0012f504 004017c3 0000007c 83360018 00000000 KERNEL32!DeviceIoControl+0x100
000200ac 00000000 00000000 00000000 00000000 NotMyfault+0x17c3

The preceding stack shows that the Notmyfault executable image, shown at the bottom, invoked the DeviceIoControl function in Kernel32.dll, which in turn invoked ZwDeviceIo-Control File in Ntdll.dll, and so on, until finally the system crashed with the execution of an instruction in the Myfault image. A stack trace like this can be useful because crashes sometimes occur as the result of one driver passing another one that is improperly formatted or corrupt or has illegal parameters. The driver that’s passed the invalid data might cause a crash and get the blame in an analysis, when the stack reveals that another driver was involved. In this sample trace, no driver other than Myfault is listed. (The module “nt” is Ntoskrnl.)

If the driver singled out by an analysis is unfamiliar to you, use the lm (list modules) command to look at the driver’s version information. Add the k (kernel modules) and v (verbose) options along with the m (match) option followed by the name of the driver and a wildcard:

lkd> lm kv m myfault*
start end module name
a98e1000 a98e1ec0 myfault (deferred)
Image path: \??\C:\Windows\system32\drivers\myfault.sys
Image name: myfault.sys
Timestamp: Sat Oct 14 16:09:18 2006 (453143EE)
CheckSum: 0000295E
ImageSize: 00000EC0
File version:
Product version:
File flags: 0 (Mask 3F)
File OS: 40004 NT Win32
File type: 3.7 Driver
File date: 00000000.00000000
Translations: 0409.04b0
CompanyName: Sysinternals
ProductName: Sysinternals Myfault
InternalName: myfault.sys
OriginalFilename: myfault.sys
ProductVersion: 2.0
FileVersion: 2.0
FileDescription: Crash Test Driver
LegalCopyright: Copyright (C) M. Russinovich 2002-2004

In addition to using the description to identify the purpose of a driver, you can also use the file and product version numbers to see whether the version installed is the most up-to-date version available. (You can do this by checking the vendor Web site, for instance.) If version information isn’t present (because it might have been paged out of physical memory at the time of the crash), look at the driver image file’s properties in Windows Explorer on the system that crashed.

Source of Information : Microsoft Press Windows Internals 5th Edition

Online Crash Analysis


When the WerFault utility executes during logon, as a result of having configured itself to start, it checks the HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\KernelFaults\Queue key to look for queued reports that may have been added in the previous dump conversion phase. It also checks whether there are previously unsent crash reports from previous sessions. If there are, it launches WerFault.exe with the –k –q flags (the q flag specifies the usage of queued reporting mode) to generate an XML-formatted file containing a basic description of the system, including the operating system version, a list of drivers installed on the machine, and the list of Plug and Play drivers loaded on the system at the time of the crash.

If configured to ask for user input (which is not the default), it then presents the dialog box, which asks the user whether he or she wants to send an error report to Microsoft. If the user chooses to send the error report, and unless overridden by Group Policy, WerFault sends the XML file and minidump to http://oca.microsoft.com, which forwards the data to a server farm for automated analysis, described in the next section.

The server farm’s automated analysis uses the same analysis engine that the Microsoft kernel debuggers use when you load a crash dump file into them (described shortly). The analysis generates a bucket ID, which is a signature that identifies a particular crash type.
The server farm queries a database using the bucket ID to see whether a resolution has been found for the crash, and it sends a URL back to WerFault that refers it to the OCA Web site (http://oca.microsoft.com). If configured to do so, WerFault launches the Windows Error Reporting Console, or WerCon (%SystemRoot%\System32\Wercon.exe), which is a program that allows users to interface with WER for receiving problem resolution and tracking information as well as for configuring WER behavior. When browsing for solutions, WerCon contains an Internet browser frame to open the page on the WER Web site that reports the preliminary crash analysis. If a resolution is available, the page instructs the user where to obtain a hotfix, service pack, or third-party driver update.

Source of Information : Microsoft Press Windows Internals 5th Edition

Windows Error Reporting


Windows includes a facility called Windows Error Reporting (WER), which facilitates the automatic submission of process and system failures (such as crashes and/or hangs) to Microsoft (or an internal error reporting server) for analysis. This feature is enabled by default, but it can be modified by changing WER’s behavior, which takes the additional step of determining whether the system is configured to send a crash dump to Microsoft for analysis on a reboot following a crash. The WER Advanced Settings screen, which you access from the Problem Reports And Solutions screen of the Control Panel’s System applet. This dialog box allows you to configure the system’s error reporting settings.

If Wininit.exe finds the HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\MachineCrash key, it executes WerFault.exe with the –k –c flags (the k flag indicates kernel error reporting, and the c flag indicates that the full or kernel dump should be converted to a minidump) to have WerFault.exe check for a kernel crash dump file. WerFault takes the following steps for preparing to send a crash dump report to the Microsoft Online Crash Analysis (OCA) site (or, if configured, an internal error reporting server):

1. If the type of dump it generated was not a minidump, it extracts a minidump from the dump file and stores it in the default location of \Windows\Minidumps, unless otherwise configured through the MinidumpDir value in the HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\ key.

2. It writes the name of the minidump files to HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\KernelFaults\Queue.

3. It adds a command to execute WerFault.exe (\Windows\System32\WerFault.exe) to HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce so that WerFault is executed one more time during the first user’s logon to the system for purposes of actually sending the error report.

Source of Information : Microsoft Press Windows Internals 5th Edition

Crash Dump Generation


When the system boots, it checks the crash dump options configured by reading the registry value HKLM\SYSTEM\CurrentControlSet\Control\CrashControl. If a dump is configured, it makes a copy of the disk miniport driver used to write to the boot volume in memory and gives it the same name as the miniport with the word “dump_” prefixed. It also checksums the components involved with writing a crash dump—including the copied disk miniport driver, the I/O manager functions that write the dump, and the map of where the boot volume’s paging file is on disk—and saves the checksum. When KeBugCheckEx executes, it checksums the components again and compares the new checksum with that obtained at the boot. If there’s not a match, it does not write a crash dump, because doing so would likely fail or corrupt the disk. Upon a successful checksum match, KeBugCheckEx writes the dump information directly to the sectors on disk occupied by the paging file, bypassing the file system driver and storage driver stack (which might be corrupted or even have caused the crash).

When the Session Manager (SMSS) re-initializes the page file during the boot process, it calls the function SmpCheckForCrashDump, which looks in the boot volume’s current paging file (created by the kernel during the boot process) to see whether a crash dump is present. SMSS then checks whether the target dump file is on a different volume than the paging file. If so, it renames the paging file to a temporary dump file name, Dumpxxx.tmp (where xxx is the current low value of the system’s tick count), and truncates the file to the size of the dump data. (This information is stored in the header on top of each dump file.) It also removes both the hidden and system attributes from the file. SMSS then creates the volatile registry
key HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\MachineCrash and stores the temporary dump file name in the value “DumpFile”. It then writes a REG_DWORD to the “TempDestination” value indicating whether the dump file location is only the temporary destination. If the paging file is on the same volume as the destination dump file, a temporary dump file isn’t used, and the paging file is directly renamed to the dump file name. In this case, the DumpFile value will be %SystemRoot%\Memory.dmp and TempDestination will be 0.

Later in the boot, Wininit checks for the presence of the MachineCrash key, and if it exists, Wininit launches WerFault, which reads the TempDestination and DumpFile values and either renames or copies the temporary file to its target location (typically %System Root%\ Mem ory.dmp, unless configured otherwise) depending on whether the target is on the same volume as the Windows directory. WerFault then writes the final dump file name to the FinalDumpFile Location value in the MachineCrash key. To support machines that might not have a paging file or no paging file on the boot volume, for example on systems that boot from a SAN or read-only media, Windows also supports the use of a dedicated dump file that is configured in the DedicatedDumpFile and DumpFileSize values under the HKLM\SYSTEM\CurrentControlSet\Control\CrashControl registry key. When a dedicated dump file is specified, the crash dump driver (%SystemRoot%\System32\Drivers\Crashdmp.sys) creates the dump file of the required size and writes the crash data there instead of the paging file. If a full or kernel dump is configured but there is not enough space on the target volume to create the dedicated dump file of the required size, the system falls back to writing a minidump.

Source of Information : Microsoft Press Windows Internals 5th Edition

Crash Dump Files


By default, all Windows systems are configured to attempt to record information about the state of the system when the system crashes. You can see these settings by opening the System tool in Control Panel, clicking the Advanced tab in the System Properties dialog box, and then clicking the Settings button under Startup And Recovery.

Three levels of information can be recorded on a system crash:

• Complete memory dump. A complete memory dump contains all of physical memory at the time of the crash. This type of dump requires that a page file be at least the size of physical memory plus 1 MB for the header. Device drivers can take advantage of up to 256 MB for device dump data, but the additional space is not required for a header. Because it can require an inordinately large page file on large memory systems, this type of dump file is the least common setting. If the system has more than 2 GB of RAM, this option will be disabled in the UI, but you can manually enable it by setting the CrashDumpEnabled value to 1 in the HKLM\SYSTEM\CurrentControlSet\Control\CrashControl registry key. At initialization time, Windows will check whether the page-file size is large enough for a complete dump and automatically switch to creating a small memory dump if not. Large server systems might not have space for a complete dump but may be able to dump useful information, so you can add the IgnorePagefileSize value to the same registry key to have the system generate a dump file until it runs out of space.

• Kernel memory dump. A kernel memory dump contains only the kernel-mode read/ write pages present in physical memory at the time of the crash. This type of dump doesn’t contain pages belonging to user processes. Because only kernel-mode code can directly cause Windows to crash, however, it’s unlikely that user process pages are necessary to debug a crash. In addition, all data structures relevant for crash dump analysis—including the list of running processes, stack of the current thread, and list of loaded drivers—are stored in nonpaged memory that saves in a kernel memory dump. There is no way to predict the size of a kernel memory dump because its size depends
on the amount of kernel-mode memory allocated by the operating system and drivers
present on the machine. This is the default setting for both Windows Vista and
Windows Server 2008.

• Small memory dump. A small memory dump, which is typically between 128 KB and
1 MB in size and is also called a minidump or triage dump, contains the stop code and parameters, the list of loaded device drivers, the data structures that describe the current process and thread, the kernel stack for the thread that caused the crash, and additional memory considered potentially relevant by crash dump heuristics, such as the pages referenced by processor registers that contain memory addresses and secondary dump data added by drivers. The debugger indicates that it has limited information available to it when it loads a minidump, and basic commands like !process, which lists active processes, don’t have the data they need. Here is an example of !process executed on a minidump:

Microsoft (R) Windows Debugger Version 6.10.0003.233 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Windows\Minidump\Mini100108-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available
0: kd> !process 0 0
GetPointerFromAddress: unable to read from 81d3a86c
Error in reading nt!_EPROCESS at 00000000

A kernel memory dump includes more information, but switching to a different process’s address space mappings won’t work because required data isn’t in the dump file. Here is an example of the debugger loading a kernel memory dump, followed by an attempt to switch process address spaces:

Microsoft (R) Windows Debugger Version 6.10.0003.233 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Windows\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available
0: kd> !process 0 0 explorer.exe
PROCESS 867250a8 ...

0: kd> .process 867250a8 ...
Process 867250a8 has invalid page directories

While a complete memory dump is a superset of the other options, it has the drawback that its size tracks the amount of physical memory on a system and can therefore become unwieldy. It’s not unusual for systems today to have several gigabytes of memory, resulting in crash dump files that are too large to be uploaded to an FTP server or burned onto a CD. Because user-mode code and data are not used during the analysis of most crashes (because crashes originate as a result of problems in kernel memory, and system data structures reside in kernel memory) much of the data stored in a complete memory dump is not relevant to analysis and therefore contributes wastefully to the size of a dump file. A final disadvantage is that the paging file on the boot volume (the volume with the \Windows directory) must be at least as large as the amount of physical memory on the system plus up to 365 MB. Because the size of the paging files required, in general, inversely tracks the amount of physical memory present, this requirement can force the paging file to be unnecessarily large. You should therefore consider the advantages offered by the small and kernel memory dump options.

An advantage of a minidump is its small size, which makes it convenient for exchange via e-mail, for example. In addition, each crash generates a file in the directory \Windows\Minidump with a unique file name consisting of the string “Mini” plus the date plus a sequence number that counts the number of minidumps on that day (for example, Mini082608-01.dmp). A disadvantage of minidumps is that to analyze them, you must have access to the exact images used on the system that generated the dump at the time you analyze the dump. (At a minimum, a copy of the matching Ntoskrnl.exe is needed to perform the most basic analysis.) This can be problematic if you want to analyze a dump on a system different from the system that generated the dump. However, the Microsoft symbol server contains images (and symbols) for all recent Windows versions, so you can set the image path in the debugger to point to the symbol server, and the debugger will automatically download the needed images. (Of course, the Microsoft symbol server won’t have images for thirdparty drivers you have installed.)

A more significant disadvantage is that the limited amount of data stored in the dump can hamper effective analysis. You can also get the advantages of minidumps even when you configure a system to generate kernel or complete crash dumps by opening the larger crash with WinDbg and using the .dump /m command to extract a minidump. Note that a minidump is automatically created even if the system is set for full or kernel dumps.

The kernel memory dump option offers a practical middle ground. Because it contains all of kernel-mode-owned physical memory, it has the same level of analysis-related data as a complete memory dump, but it omits the usually irrelevant user-mode data and code, and therefore can be significantly smaller. As an example, on a system running Windows Vista with 4 GB of RAM, a kernel memory dump was 160 MB in size.

When you configure kernel memory dumps, the system checks whether the paging file is large enough, as described earlier. Some general recommendations, but these are only estimated sizes because there is no way to predict the size of a kernel memory dump. The reason you can’t predict the size of a kernel memory dump is that its size depends on the amount of kernel-mode memory in use by the operating system and drivers present on the machine at the time of the crash.

Therefore, it is possible that at the time of the crash, the paging file is too small to hold a kernel dump. If you want to see the size of a kernel dump on your system, force a manual crash either by configuring the option to allow you to initiate a manual system crash from
the console or by using the Notmyfault tool. When you reboot, you can check to make sure that a kernel dump was generated and check its size to gauge how large to make your boot volume paging file. To be conservative, on 32-bit systems you can choose a page file size of 2 GB plus up to 356 MB, because 2 GB is the maximum kernel-mode address space available (unless you are booting with the 3gb and/or userva boot options, in which case this can be up to 3 GB). If you do not have enough space on the boot volume for saving the memory.dmp file, you can choose a location on any other local hard disk through the dialog box.

Source of Information : Microsoft Press Windows Internals 5th Edition

Troubleshooting Crashes


You often begin seeing blue screens after you install a new software product or piece of hardware. If you’ve just added a driver, rebooted, and gotten a blue screen early in system initialization, you can reset the machine, press the F8 key when instructed, and then select Last Known Good Configuration. Enabling last known good causes Windows to revert to a copy of the registry’s device driver registration key (HKLM\SYSTEM\CurrentControlSet\Services) from the last successful boot (before you installed the driver). From the perspective of last known good, a successful boot is one in which all services and drivers have finished loading and at least one logon has succeeded.

During the reboot after a crash, the Boot Manager (Bootmgr) will automatically detect that Windows did not shut down properly and display a Windows Error Recovery message. This screen gives you the option to attempt booting into safe mode so that you can disable or uninstall the software component that might be broken.

If you keep getting blue screens, an obvious approach is to uninstall the components you added just before the first blue screen appeared. If some time has passed since you added something new or you added several things at about the same time, you need to note the names of the device drivers referenced in any of the parameters. If you recognize any of the names as being related to something you just added (such as Storport.sys if you put on a new SCSI drive), you’ve possibly found your culprit.

Many device drivers have cryptic names, but one approach you can take to figure out which application or hardware device is associated with a name is to find out the name of the service in the registry associated with a device driver by searching for the name of the device driver under the HKLM\SYSTEM\CurrentControlSet\Services key. This branch of the registry is where Windows stores registration information for every device driver in the system. If you find a match, look for values named DisplayName and Description. Some drivers fill in these values to describe the device driver’s purpose. For example, you might find the string “Virus Scanner” in the DisplayName value, which can implicate the antivirus software you have running. The list of drivers can be displayed in the System Information tool (from the Start menu, select Programs, System Tools, System Information. In System Information, expand Software Environment, and then select System Drivers. Process Explorer also lists the currently loaded drivers, including their version numbers and load addresses, in the DLL view of the System process. Another option is to open the Properties dialog box for the driver file and examine the information on the Details tab, which often contains the description and company information for the driver. Keep in mind that the registry information and file description are provided by the driver manufacturer, and there is nothing to guarantee their accuracy.

More often than not, however, the stop code and the four associated parameters aren’t enough information to troubleshoot a system crash. For example, you might need to examine the kernel-mode call stack to pinpoint the driver or system component that triggered the crash. Also, because the default behavior on Windows systems is to automatically reboot after a system crash, it’s unlikely that you would have time to record the information displayed on the blue screen. That is why, by default, Windows attempts to record information about the system crash to the disk for later analysis, which takes us to our next topic, crash dump files.

Source of Information : Microsoft Press Windows Internals 5th Edition

The Blue Screen


Regardless of the reason for a system crash, the function that actually performs the crash is KeBugCheckEx, documented in the Windows Driver Kit (WDK). This function takes a stop code (sometimes called a bugcheck code) and four parameters that are interpreted on a per–stop code basis. After KeBugCheckEx masks out all interrupts on all processors of the system, it switches the display into a low-resolution VGA graphics mode (one implemented by all Windows-supported video cards), paints a blue background, and then displays the stop code, followed by some text suggesting what the user can do. Finally, KeBugCheckEx calls any registered device driver bugcheck callbacks (registered by calling the KeRegisterBugCheckCallback function), allowing drivers an opportunity to stop their devices. It then calls registered reason callbacks (registered with KeRegisterBugCheckReasonCallback), which allow drivers to append data to the crash dump or write crash dump information to alternate devices.

KeBugCheckEx displays the textual representation of the stop code near the top of the blue screen and the numeric stop code and four parameters at the bottom of the blue screen.

The first line in the Technical information section lists the stop code and the four additional parameters passed to KeBugCheckEx. A text line near the top of the screen provides the text equivalent of the stop code’s numeric identifier. stop code 0x000000D1 is a DRIVER_IRQL_NOT_LESS_OR_EQUAL crash. When a parameter contains an address of a piece of operating system or device driver code, Windows displays the base address of the module the address falls in, the date stamp, and the file name of the device driver. This information alone might help you pinpoint the faulty component.

Although there are more than 300 unique stop codes, most are rarely, if ever, seen on production systems. Instead, just a few common stop codes represent the majority of Windows system crashes. Also, the meaning of the four additional parameters depends on the stop code (and not all stop codes have extended parameter information). Nevertheless, looking up the stop code and the meaning of the parameters (if applicable) might at least assist you in diagnosing the component that is failing (or the hardware device that is causing the crash).

You can find stop code information in the section “Bug Checks (Blue Screens)” in the Debugging Tools for Windows help file. (For information on the Debugging Tools for Windows.) You can also search Microsoft’s Knowledge Base (http://support.microsoft.com) for the stop code and the name of the suspect hardware or application. You might find information about a workaround, an update, or a service pack that fixes the problem you’re having. The Bugcodes.h file in the WDK contains a complete list of the 300 or so stop codes, with some additional details on the reasons for some of them.

Based on data collected from the release of Windows Vista through the release of Windows Vista SP1, the top 30 stop codes account for 96 percent of crashes and can be grouped into a dozen categories:
• Page fault. A page fault on memory backed by data in a paging file or a memorymapped file occurs at an IRQL of DPC/dispatch level or above, which would require the memory manager to have to wait for an I/O operation to occur. The kernel cannot wait or reschedule threads at an IRQL of DPC/dispatch level or higher. This category also includes page faults in nonpaged areas. The common stop codes are:

• Power management. A device driver or an operating system function running in kernel mode is in an inconsistent or invalid power state. Most frequently, some component has failed to complete a power management I/O request operation within 10 minutes. This crash category is new in Windows Vista. In previous versions of the Windows operating system, these failures generally resulted in a system hang with no crash. The stop codes are:

• Exceptions and traps A device driver or an operating system function running in kernel mode incurs an unexpected exception or trap. The common stop codes are:
- 0x8E - KERNEL_MODE_EXCEPTION_NOT_HANDLED with P1 != 0xC0000005

• Access violations A device driver or an operating system function running in kernel
mode incurs a memory access violation, which is caused either by attempting to write
to a read-only page or by attempting to read an address that isn’t currently mapped
and therefore is not a valid memory location. The common stop codes are:

• Display The display device driver detects that it can no longer control the graphics processing unit or detects an inconsistency in video memory management. The common stop codes are:

• Pool. The kernel pool manager detects an improper pool reference. The common stop codes are:

• Memory management. The kernel memory manager detects a corruption of memory management data structures or an improper memory management request. The common stop codes are:

• Consistency check. This is a catch-all category for various other consistency checks performed by the kernel or device drivers. The common stop codes are:
- 0x8086 – This is a stop code used by the Intel storage driver iastor.sys

• Hardware. A hardware error, such as a machine check or a nonmaskable interrupt
(NMI), occurs. This category also includes disk failures when the memory manager is attempting to read data to satisfy page faults. The common stop codes are:
- 0x101 - CLOCK_WATCHDOG_TIMEOUT (Software bugs can cause these errors too, but they are most common on over-clocked hardware systems.)

• USB. An unrecoverable error occurs in a universal serial bus operation. The common stop code is:

• Critical object. A fatal error occurs in a critical object without which Windows cannot continue to run. The common stop code is:

• NTFS file system. A fatal error is detected by the NTFS file system. The common stop code is:

Source of Information : Microsoft Press Windows Internals 5th Edition

Why Does Windows Crash?


Windows crashes (stops execution and displays the blue screen) for many possible reasons. A common source is a reference to a memory address that causes an access violation, either a write operation to read-only memory or a read operation on an address that is not mapped. Another common cause is an unexpected exception or trap. Crashes also occur when a kernel subsystem (such as the memory manager and power manager) or a driver (such as a USB or display driver) detect inconsistencies in their operation.

When a kernel-mode device driver or subsystem causes an illegal exception, Windows faces a difficult dilemma. It has detected that a part of the operating system with the ability to access any hardware device and any valid memory has done something it wasn’t supposed to do.

But why does that mean Windows has to crash? Couldn’t it just ignore the exception and let the device driver or subsystem continue as if nothing had happened? The possibility exists that the error was isolated and that the component will somehow recover. But what’s more likely is that the detected exception resulted from deeper problems—for example, from a general corruption of memory or from a hardware device that’s not functioning properly. Permitting the system to continue operating would probably result in more exceptions, and data stored on disk or other peripherals could become corrupt—a risk that’s too high to take. So Windows adopts a fail fast policy in attempting to prevent the corruption in RAM from spreading to disk.

Source of Information : Microsoft Press Windows Internals 5th Edition

Understanding Exchange Server Messaging roles


Exchange Server 2010 implementations have three layers in their architecture: a network layer, directory layer, and messaging layer. The messaging layer is where you define and deploy the Exchange Server roles. The Exchange servers at the core of the messaging layer can operate in the following roles:

• Mailbox Server. A back-end server that hosts mailboxes, public folders, and related messaging data, such as address lists, resource scheduling, and meeting items. For high availability of mailbox databases, you can use database availability groups.

• Client access Server. A middle-tier server that accepts connections to Exchange Server from a variety of clients. This server hosts the protocols used by all clients when checking messages. On the local network, Outlook MAPI clients are connected directly to the Client Access server to check mail. Remote users can check their mail over the Internet by using Outlook Anywhere, Outlook Web App, Exchange ActiveSync, POP3, or IMAP4.

• Unified Messaging Server. A middle-tier server that integrates a private branch exchange (PBX) system with Exchange Server 2010, allowing voice messages and faxes to be stored with e-mail in a user’s mailbox. Unified messaging supports call answering with automated greetings and message recording, fax receiving, and dial-in access. With dial-in access, users can use Outlook Voice Access to check voice mail, e-mail, and calendar information; to review or dial contacts; and to configure preferences and personal options. To receive faxes, you need an integrated solution from a Microsoft partner.

• Hub transport Server. A mail routing server that handles mail flow, routing, and delivery within the Exchange organization. This server processes all mail that is sent inside the organization before it is delivered to a mailbox in the organization or routed to users outside the organization. Processing ensures that senders and recipients are resolved and filtered as appropriate, content is filtered and has its format converted if necessary, and attachments are screened. To meet any regulatory or organizational compliance requirements, the Hub Transport server can also record, or journal, messages and add disclaimers to them.

• Edge transport Server. An additional mail routing server that routes mail into and out of the Exchange organization. This server is designed to be deployed in an organization’s perimeter network and is used to establish a secure boundary between the organization and the Internet. This server accepts mail coming into the organization from the Internet and from trusted servers in external organizations, processes the mail to protect against some types of spam messages and viruses, and routes all accepted messages to a Hub Transport server inside the organization.

These five roles are the building blocks of Exchange organizations. Processors can be single core, dual core, or multiple core. A dedicated Mailbox server has a recommended maximum number of processor cores of 12, but a server with the Mailbox and other roles combined has a recommended maximum of 16. Note that although Exchange Server 2010 can support this number of processor cores, it might make more sense to scale out to multiple servers rather than to scale up the processor cores on a single server.

Because you can combine all of the roles except the Edge Transport server role on a single server, one of the most basic Exchange organizations you can create is one that includes a single Exchange server that provides the Mailbox server, Client Access server, and Hub Transport server roles. These three roles are the minimum required for routing and delivering messages to both local and remote messaging clients. For added security and protection, you can deploy the Edge Transport server role in a perimeter network on one or more separate servers.

Although a basic implementation of Exchange Server might include only one server, you’ll likely find investing in multiple servers is more effective in terms of time, money, and resources. Why? High availability is integrated into the core architecture of Exchange Server 2010.

With the Mailbox server role, you can configure automatic failover by making the Mailbox servers members of the same database availability group. Each Mailbox server in the group can then have a copy of the mailbox databases from the other Mailbox servers in the group. Each mailbox database can have up to 16 copies, and this means you can have up to 16 Mailbox servers in a group as well.

With the Client Access role, you can enable load balancing and failover support by making Client Access servers members of the same Client Access array. Each Client Access server in the array will then be able to support all client access features, including Outlook MAPI, POP3, IMAP4, Outlook Anywhere, Outlook Web App, and Exchange ActiveSync. You can use Client Access arrays to build groups of up to 32 load-balanced servers, starting with as few a two servers and incrementally scaling as demand increases. Servers that are members of an array cannot also have the Mailbox role. If you are using the Network Load Balancing service, Microsoft recommends no more than eight load-balanced servers.

Because of the built-in, high-availability features, the hardware you use with Exchange Server 2010 might be very different from the hardware you use with earlier releases of Exchange Server.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Deploying Exchange Server 2010


Before you deploy Microsoft Exchange Server 2010, you should carefully plan the messaging architecture. Every Exchange implementation has three layers in its architecture:

• Network layer. The network layer provides the foundation for computerto-computer communications and essential name resolution features. The network layer has both physical and logical components. The physical components include the IP addresses, the IP subnets, local area network (LAN) or wide area network (WAN) links used by messaging systems as well as the routers that connect these links, and firewalls that protect the infrastructure. The logical components are the Domain Name System (DNS) zones that define the naming boundaries and contain the essential resource records required for name resolution.

• Directory layer. The directory layer provides the foundation necessary for authentication, authorization, and replication. The directory layer is built on the Active Directory directory service and has both physical and logical components. The physical components include the domain controllers, Global Catalog servers, and site links used for authentication, authorization, and replication. The logical components include the Active Directory forests, sites, domains, and organizational units that are used to group objects for resource sharing, centralized management, and replication control. The logical components also include the users and groups that are part of the Active Directory infrastructure.

• Messaging layer. The messaging layer provides the foundation for messaging and collaboration. The messaging layer has both physical and logical components. The physical components include individual Exchange servers that determine how messages are delivered and mail connectors that determine how messages are routed outside an Exchange server’s routing boundaries. The logical components specify the organizational boundaries for messaging, mailboxes used for storing messages, public folders used for storing data, and distribution lists used for distributing messages to multiple recipients.

Whether you are deploying Exchange Server for the first time in your organization or upgrading to Exchange Server 2010 from an earlier release of Exchange Server, you need to closely review each layer of this architecture and plan for required changes. As part of your implementation planning, you also need to look closely at the roles your Exchange servers will perform and modify the hardware accordingly to meet the requirements of these roles on a per-server basis. Exchange Server is no longer the simple messaging server that it once was. It is now a complex messaging platform with many components that work together to provide a comprehensive solution for routing, delivering, and accessing e-mail messages, voicemail messages, faxes, contacts, and calendar information.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Understanding how Exchange routes Messages


Within the organization, Hub Transport servers use the information about sites stored in Active Directory to determine how to route messages, and they can also route messages across site links. The Hub Transport server does this by querying Active Directory about its site membership and the site membership of other servers, and then it uses the information it discovers to route messages appropriately. Because of this, when you are deploying an Exchange Server 2010 organization, no additional configuration is required to establish routing in the Active Directory forest.

For mail delivery within the organization, additional routing configuration is necessary only in these specific scenarios:

• If you deploy Exchange Server 2010 in an existing Exchange Server 2003 organization, you must configure a two-way routing group connector from the Exchange routing group to each Exchange Server 2003 routing group that communicates with Exchange Server 2010. You must also suppress link state updates for the same.

• If you deploy an Exchange Server 2010 organization with multiple forests, you must install Exchange Server 2010 in each forest and then connect the forests using appropriate cross-forest trusts. The trust allows users to see address and availability data across the forests.

• In an Exchange Server 2010 organization, if you want direct mail flow between Exchange servers in different forests, you must configure SMTP send connectors and SMTP receive connectors on the Hub Transport servers that should communicate directly with each other.

The organization’s Mail Transport servers handle mail delivery outside the organization and receipt of mail from outside servers. You can use two types of Mail Transport servers: Hub Transport servers and Edge Transport servers. You deploy Hub Transport servers within the organization. You can optionally deploy Edge Transport servers in the organization’s perimeter network for added security. Typically a perimeter network is a secure network set up outside the organization’s private network.

With Hub Transport servers, no other special configuration is needed for message routing to external destinations. You must configure only the standard mail setup, which includes identifying DNS servers to use for lookups. With Edge Transport servers, you can optimize mail routing and delivery by configuring one-way synchronization from the internal Hub Transport servers to the perimeter network’s Edge Transport servers. Beyond this, no other special configuration is required for mail routing and delivery.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Understanding how Exchange Stores Information


Exchange stores four types of data in Active Directory: schema data (stored in the Schema partition), configuration data (stored in the Configuration partition), domain data (stored in the Domain partition), and application data (stored in application specific partitions). In Active Directory, schema rules determine what types of objects are available and what attributes those objects have. When you install the first Exchange server in the forest, the Active Directory preparation process adds many Exchange-specific object classes and attributes to the schema partition in Active Directory. This allows Exchange-specific objects, such as agents and connectors, to be created. It also allows you to extend existing objects, such as users and groups, with new attributes, such as attributes that allow user objects to be used for sending and receiving e-mail. Every domain controller and global catalog server in the organization has a complete copy of the Schema partition.

During the installation of the first Exchange server in the forest, Exchange configuration information is generated and stored in Active Directory. Exchange configuration information, like other configuration information, is also stored in the Configuration partition. For Active Directory, the configuration information describes the structure of the directory, and the Configuration container includes all of the domains, trees, and forests, as well as the locations of domain controllers and global catalogs. For Exchange, the configuration information is used to describe the structure of the Exchange organization. The Configuration container includes lists of templates, policies, and other global organization-level details. Every domain controller and global catalog server in the organization has a complete copy of the Configuration partition.

In Active Directory, the Domain partition stores domain-specific objects, such as users and groups, and the stored values of attributes associated with those objects. As you create, modify, or delete objects, Exchange stores the details about those objects in the Domain partition. During the installation of the first Exchange server in the forest, Exchange objects are created in the current domain. Whenever you create new recipients or modify Exchange details, the related changes are reflected in the Domain partition as well. Every domain controller has a complete copy of the Domain partition for the domain for which it is authoritative. Every global catalog server in the forest maintains information about a subset of every Domain partition in the forest.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Exchange Server Security Groups


Like Exchange Server 2007, Exchange Server 2010 uses predefined universal security groups to separate administration of Exchange permissions from administration of other permissions. When you add an administrator to one of these security groups, the administrator inherits the permissions permitted by that role.

The predefined security groups have permissions to manage the following types of Exchange data in Active Directory:

• Organization Configuration node This type of data is not associated with a specific server and is used to manage databases, policies, address lists, and other types of organizational configuration details.

• Server Configuration node This type of data is associated with a specific server and is used to manage the server’s messaging configuration.

• recipient Configuration node This type of data is associated with mailboxes, mail-enabled contacts, and distribution groups.

In Exchange Server 2010, databases have been moved from the Server Configuration node to the Organization Configuration node. this change was necessary because the Exchange schema was flattened and storage groups were removed. As a result of these changes, all storage group functionality has been moved to the database level.

The predefined groups are as follows:
• Delegated Setup Members of this group have permission to install and uninstall Exchange on provisioned servers.

• Discovery Management Members of this group can perform mailbox searches for data that meets specific criteria.

• Exchange all hosted Organizations Members of this group include
hosted organization mailbox groups. This group is used to apply Password
Setting objects to all hosted mailboxes.

• Exchange Servers Members of this group are Exchange servers in the
organization. This group allows Exchange servers to work together.

• Exchange trusted Subsystem Members of this group are Exchange servers that run Exchange cmdlets using WinRM. Members of this group have permission to read and modify all Exchange configuration settings as well as user accounts and groups.

• Exchange Windows permissions Members of this group are Exchange servers that run Exchange cmdlets using WinRM. Members of this group have permission to read and modify user accounts and groups.

• ExchangelegacyInterop Members of this group are granted send-to and receive-from permissions, which are necessary for routing group connections between Exchange Server 2010 and Exchange Server 2003. Exchange Server 2003 bridgehead servers must be made members of this group to allow proper mail flow in the organization.

• Help Desk Members of this group can view any property or object within the Exchange organization and have limited management permissions, including the right to change and reset passwords.

• Hygiene Management Members of this group can manage the antispam and antivirus features of Exchange.

• Organization Management Members of this group have full access to all Exchange properties and objects in the Exchange organization.

• Public Folder Management Members of this group can manage public folders and perform most public folder management operations.

• Recipient Management Members of this group have permissions to modify Exchange user attributes in Active Directory and perform most mailbox operations.

• Records Management Members of this group can manage compliance features, including retention policies, message classifications, and transport rules.

• Server Management Members of this group can manage all Exchange servers in the organization but do not have permission to perform global operations.

• UM Management Members of this group can manage all aspects of unified messaging, including unified messaging server configuration and unified messaging recipient configuration.

• View-Only Organization Management Members of this group have read-only access to the entire Exchange organization tree in the Active Directory configuration container and read-only access to all the Windows domain containers that have Exchange recipients.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Exchange Server Authentication and Security


In Exchange Server 2010, e-mail addresses, distribution groups, and other directory resources are stored in the directory database provided by Active Directory. Active Directory is a directory service running on Windows domain controllers. When there are multiple domain controllers, the controllers automatically replicate directory data with each other using a multimaster replication model. This model allows any domain controller to process directory changes and then replicate those changes to other domain controllers.

The first time you install Exchange Server 2010 in a Windows domain, the installation process updates and extends Active Directory to include objects and attributes used by Exchange Server 2010. Unlike Exchange Server 2003 and earlier releases of Exchange, this process does not include updates for the Active Directory Users And Computers Snap-In for Microsoft Management Console (MMC), and you do not use Active Directory Users And Computers to manage mailboxes, messaging features, messaging options, or e-mail addresses associated with user accounts. You perform these tasks using the Exchange Management tools.

Exchange Server 2010 fully supports the Windows Server security model and relies on this security mechanism to control access to directory resources. This means you can control access to mailboxes and membership in distribution groups and you can perform other Exchange security administration tasks through the standard Windows Server permission set. For example, to add a user to a distribution group, you simply make the user a member of the distribution group in Active Directory Users And Computers.

Because Exchange Server uses Windows Server security, you can’t create a mailbox without first creating a user account that will use the mailbox. Every Exchange mailbox must be associated with a domain account—even those used by Exchange for general messaging tasks. For example, the SMTP and System Attendant mailboxes that Exchange Server uses are associated by default with the built-in System user. In the Exchange Management Console, you can create a new user account as part of the process of creating a new mailbox.

To support coexistence with Exchange Server 2003, all Exchange Server 2010 servers are automatically added to a single administrative group when you install Exchange Server 2010. this administrative group is recognized in the Exchange System Manager in Exchange Server 2003 as “Exchange Administrative Group.” Although Exchange Server 2003 uses administrative groups to gather Exchange objects for the purposes of delegating permission to manage those objects, Exchange Server 2007 and Exchange Server 2010 do not use administrative groups. Instead, you manage Exchange servers according to their roles and the type of information you want to manage using the Exchange Management Console.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Exchange Server 2010 and Your hardware


Before you deploy Exchange Server 2010, you should carefully plan the messaging architecture. As part of your implementation planning, you need to look closely at preinstallation requirements and the hardware you will use. Exchange Server is no longer the simple messaging server that it once was. It is now a complex messaging platform with many components that work together to provide a comprehensive solution for routing, delivering, and accessing e-mail messages, voice-mail messages, faxes, contacts, and calendar information.

Successful Exchange Server administration depends on three things:
• Knowledgeable Exchange administrators
• Strong architecture
• Appropriate hardware

The first two ingredients are covered: you’re the administrator, you’re smart enough to buy this book to help you through the rough spots, and you’ve enlisted Exchange Server 2010 to provide your high-performance messaging needs. This brings us to the issue of hardware. Exchange Server 2010 should run on a system with adequate memory, processing speed, and disk space. You also need an appropriate data-protection and system-protection plan at the hardware level. Key guidelines for choosing hardware for Exchange Server are as follows:

• Memory Exchange Server 2010 has been tested and developed for maximum memory configurations of 64 gigabytes (GB) for Mailbox servers and 16 GB for all other server roles except Unified Messaging. For Unified Messaging, the maximum is 8 GB. For multirole servers, the maximum is 64 GB. The minimum random access memory (RAM) is 2 GB. In most cases, you’ll want to have at least twice the recommended minimum amount of memory. The primary reason for this is performance. Most of the Exchange installations I run use 4 GB of RAM as a starting point, even in small installations. In multiple Exchange server installations, the Mailbox server should have at least 2 GB of RAM plus 5 megabytes (MB) of RAM per mailbox. For all Exchange server configurations, the paging file should be at least equal to the amount of RAM in the server plus 10 MB.

• CPU Exchange Server 2010 runs on the x64 family of processors from AMD and Intel, including AMD64 and Intel Extended Memory 64 Technology (Intel EM64T). Exchange Server 2010 provides solid benchmark performance with Intel Xeon 3.4 GHz and higher or AMD Opteron 3.1 GHz and higher. Any of these CPUs provide good starting points for the average Exchange Server system. You can achieve significant performance improvements with a high level of processor cache. Look closely at the L1, L2, and L3 cache options available—a higher cache can yield much better performance overall. Look also at the speed of the front-side bus. The faster the bus speed, the faster the CPU can access memory. Exchange Server 2010 runs only on 64-bit hardware. The primary advantages of 64-bit processors over 32-bit processors are related to memory limitations and data access. Because 64-bit processors can address more than 4 GB of memory at a time without physical address extension, they can store greater amounts of data in main memory, providing direct access to and faster processing of data. In addition, 64-bit processors can process data and execute instruction sets that are twice as large as 32-bit processors. Accessing 64 bits of data (versus 32 bits) offers a significant advantage when processing complex calculations that require a high level of precision.

Note At the time of this writing, 64-bit versions do not support Intel Itanium.

• SMP Exchange Server 2010 supports symmetric multiprocessors, and you’ll see significant performance improvements if you use multiple CPUs. Microsoft tested and developed Exchange Server 2010 for use with dual-core and multicore CPUs as well. The minimum, recommended, and maximum number of CPUs—whether single core, dual core, or multicore—depends on a server’s Exchange roles. Still, if Exchange Server is supporting a small organization with a single domain, one CPU with multiple cores should be enough. If the server supports a medium or large organization or handles mail for multiple domains, you might want to consider adding processors. When it comes to processor cores, I prefer two 4-core processors to a single 8-core processor given current price and performance tradeoffs. An alternative is to distribute the workload across different servers based on where you locate resources.

• Disk drives The data storage capacity you need depends entirely on the number and size of the data that will pass through, be journaled on, or stored on the Exchange server. You need enough disk space to store all data and logs, plus workspace, system files, and virtual memory. Input/output (I/O) throughput is just as important as drive capacity. Rather than use one large drive, you should use several drives, which allow you to configure fault tolerance with RAID.

• Data protection You can add protection against unexpected drive failures by using RAID. For the boot and system disks, use RAID 1 on internal drives. However, because of the new high-availability features, you might not want to use RAID for Exchange data and logs. You also might not want to use expensive disk storage systems either. Instead, you might want to deploy multiple Exchange servers with each of your Exchange roles. If you decide to use RAID, remember that storage arrays typically already have an underlying RAID configuration and you might have to use a tool such as Storage Manager For SANs to help you distinguish between logical unit numbers (LUNs) and physical disks. For data, use RAID 0 or RAID 5. For logs, use RAID 1. RAID 0 (disk striping without parity) offers good read/ write performance, but any failed drive means that Exchange Server can’t continue operation on an affected database until the drive is replaced and data is restored from backup. RAID 1 (disk mirroring) creates duplicate copies of data on separate drives; you can rebuild the RAID unit to restore full operations and can continue operations if one of the drives fails. RAID 5 (disk striping with parity) offers good protection against single drive failure, but it has poor write performance. For best performance and fault tolerance, RAID 10 (also referred to as RAID 0 + 1), which consists of disk mirroring and disk striping without parity, is also an option.

• Uninterruptible power supply Exchange Server 2010 is designed to maintain database integrity at all times and can recover information using transaction logs. This doesn’t protect the server hardware, however, from sudden power loss or power spikes, both of which can seriously damage hardware. To prevent this, connect your server to an uninterruptible power supply (UPS). A UPS gives you time to shut down the server or servers properly in the event of a power outage. Proper shutdown is especially important on servers using write-back caching controllers. These controllers temporarily store data in cache. Without proper shutdown, this data can be lost before it is written to disk. Note that most write-back caching controllers have batteries that help ensure that changes can be written to disk after the system comes back online.

If you follow these hardware guidelines and modify them for specific messaging roles, as discussed in the next section, you’ll be well on your way to success with Exchange Server 2010.

Source of Information : Microsoft Press - Exchange Server 2010 Administrators Pocket Consultant

Alltop, all the top stories
All Malaysian Bloggers Project
Computer Blogs - BlogCatalog Blog Directory Add to Technorati Favorites
Technorati Profile
Top Computers blogs