Windows Internals

Mark E. Russinovich, David A. Solomon, Alex Ionescu

Mentioned 3

A guide to the architecture and internal structure of Microsoft Windows 7 and Microsoft Windows server 2008 R2.

More on

Mentioned in questions and answers.

How can I find details of the Windows C++ memory allocator that I am using?

Debugging my C++ application is showing the following in the call stack:

ntdll.dll!RtlEnterCriticalSection()  - 0x4b75 bytes 
ntdll.dll!RtlpAllocateHeap()  - 0x2f860 bytes   
ntdll.dll!RtlAllocateHeap()  + 0x178 bytes  
ntdll.dll!RtlpAllocateUserBlock()  + 0x56c2 bytes   
ntdll.dll!RtlpLowFragHeapAllocFromContext()  - 0x2ec64 bytes    
ntdll.dll!RtlAllocateHeap()  + 0xe8 bytes   
msvcr100.dll!malloc()  + 0x5b bytes 
msvcr100.dll!operator new()  + 0x1f bytes   

My multithreaded code is scaling very poorly, and profiling through random sampling indicates that malloc is currently a bottleneck in my multithreading code. The stack seems to indicate some locking going on during memory allocation. How can I find details of this particular malloc implementation?

I've read that Windows 7 system allocator performance is now competitive with allocators like tcmalloc and jemalloc. I am running on Windows 7 and I'm building with Visual Studio 2010. Is msvcr100.dll the fast/scalable "Windows 7 system allocator" often referenced as "State of the Art"?

On Linux, I've seen dramatic performance gains in multithreaded code by changing the allocator, but I've never experimented with this on Windows -- thanks.

am simply asking what malloc implementation I am using with maybe a link to some details about my particular version of this implementation.

The callstack you are seeing indicates that the MSVCRT (more exactly, it default operator new => malloc are calling into the Win32 Heap functions. (I do not know whether malloc routes all requests directly to the CRT's Win32 Heap, or whether it does some additional caching - but if you have VS, you should have the CRT source code too, so should be able to check that.) (The Windows Internals book also talk about the Heap.)

General advice I can give is that in my experience (VS 2005, but judging from Hans' answer on the other question VS2010 may be similar) the multithreaded performance of the CRT heap can cause noticeable problems, even if you're not doing insane amounts of allocations.

That RtlEnterCriticalSection is just that, a Win32 Critical Section: Cheap to lock with low contention, but with higher you will see suboptimal runtime behaviour. (Bah! Ever tried to profile / optimize code that coughs on synchronization performance? It's a mess.)

One solution is to split the heaps: Using different Heaps has given us significant improvements, even though each heap still is MT enabled (no HEAP_NO_SERIALIZE).

Since you're "coming in" via operator new, you might be able to use different allocators for some of the different classes that are allocated often. Or maybe some of your containers could benefit from custom allocators (that then use a separate heap).

One case we had, was that we were using libxml2 for XML parsing, and while building up the DOM tree, it simply swamps the system in malloc calls. Luckily, it uses its own set of memory allocation routines that can be easily replaced by a thin wrapper over the Win32 Heap functions. This gave us huge improvements, as XML parsing didn't interfere with the rest of the system's allocations anymore.

Let us say there are two processes A and B. B needs to insert a new frame in its page table. as there are no frames free, we have to swap out one frame and bring in B's frame from disk. Suppose the operating system follows global page replacement scheme and picks up a frame in which we have A's data. Now, to swap this frame out we need to change in A's page table that corresponding frame is invalid. To do that in general.we need to know which process's data is there in a particular frame in the memory so that we can go to it's page table and alter the bit to invalid. How is this acchieved? Does each frame in the memory also store process id of the corresponding process whose data it is having?

Page Table is just the facility required by the processor hardware. On top of that, the OS is maintaining its own databases in memory keeping track of each physical page frame. For example, in Windows, there is a Page Frame Database (PFN) listing out the status of each physical page like Valid, Standby, Modified, Free, etc. And for describing the subset of virtual pages residing in physical memory, there is a Working Set List.

For Windows, if you need to know more about the details of memory management, I suggest this book

please adisve on below:

1) What is the lightest way to attach to running native windows application process, get list of threads and see what DDLs are used?

2) What is the lightest way to attach to running .NET application process, get list of threads and see what DDLs are used?

Regards, Ron

Do you use Visual Studio? If so, you can attach VS to any running process using the Debug | Attach To Process menu items. You can then break into the process and start examining stacks, threads, modules, etc.

If you want to delve deeper, you could download the Windows SDK and install the Debugging tools. This will give you KD and WinDBG - a console debugger and slightly more friendly multi-pane MDI-style debugging app respectively. Using these tools you can access to most of the core debugging infrastructure built into Windows.

However, note that this is not for the feint of heart and will require considerable time and effort to master. To really become a debugging guru, you'll also need to deeply understand the architecture of the kernel & OS and many core OS data structures.

Thus you might find the following books useful:

For .NET:

For Windows and/or .NET:

For Advanced Windows internals debugging

Enjoy! :)