Windows Internals

Mark E. Russinovich, David A. Solomon, Alex Ionescu

Mentioned 32

See how the core components of the Windows operating system work behind the scenes--guided by a team of internationally renowned internals experts. Fully updated for Windows Server(R) 2008 and Windows Vista(R), this classic guide delivers key architectural insights on system design, debugging, performance, and support--along with hands-on experiments to experience Windows internal behavior firsthand. Delve inside Windows architecture and internals: Understand how the core system and management mechanisms work--from the object manager to services to the registry Explore internal system data structures using tools like the kernel debugger Grasp the scheduler's priority and CPU placement algorithms Go inside the Windows security model to see how it authorizes access to data Understand how Windows manages physical and virtual memory Tour the Windows networking stack from top to bottom--including APIs, protocol drivers, and network adapter drivers Troubleshoot file-system access problems and system boot problems Learn how to analyze crashes

More on Amazon.com

Mentioned in questions and answers.

So a .exe file is a file that can be executed by windows, but what exactly does it contain? Assembly language that's processor specific? Or some sort of intermediate statement that's recognized by windows which turns it into assembly for a specific processor? What exactly does windows do with the file when it "executes" it?

MSDN has an article "An In-Depth Look into the Win32 Portable Executable File Format" that describes the structure of an executable file.

Basically, a .exe contains several blobs of data and instructions on how they should be loaded into memory. Some of these sections happen to contain machine code that can be executed (other sections contain program data, resources, relocation information, import information, etc.)

I suggest you get a copy of Windows Internals for a full description of what happens when you run an exe.

For a native executable, the machine code is platform specific. The .exe's header indicates what platform the .exe is for.

When running a native .exe the following happens (grossly simplified):

  • A process object is created.
  • The exe file is read into that process's memory. Different sections of the .exe (code, data, etc.) are mapped in separately and given different permissions (code is execute, data is read/write, constants are read-only).
  • Relocations occur in the .exe (addresses get patched if the .exe was not loaded at its preferred address.)
  • The import table is walked and dependent DLL's are loaded.
  • DLL's are mapped in a similar method to .exe's, with relocations occuring and their dependent DLL's being loaded. Imported functions from DLL's are resolved.
  • The process starts execution at an initial stub in NTDLL.
  • The initial loader stub runs the entry points for each DLL, and then jumps to the entry point of the .exe.

Managed executables contain MSIL (Microsoft Intermediate Language) and may be compiled so they can target any CPU that the CLR supports. I am not that familiar with the inner workings of the CLR loader (what native code initially runs to boot strap the CLR and start interpreting the MSIL) - perhaps someone else can elaborate on that.

I need a way to insert some file clusters into the middle of a file to insert some data.

Normally, I would just read the entire file and write it back out again with the changes, but the files are multiple gigabytes in size, and it takes 30 minutes just to read the file and write it back out again.

The cluster size doesn't bother me; I can essentially write out zeroes to the end of my inserted clusters, and it will still work in this file format.

How would I use the Windows File API (or some other mechanism) to modify the File Allocation Table of a file, inserting one or more unused clusters at a specified point in the middle of the file?

Robert, I don't think that what you want to achieve is really possible to do without actively manipulating file system data structures for a file system which, from the sounds of it, is mounted. I don't think I have to tell you how dangerous and unwise this sort of exercise it.

But if you need to do it, I guess I can give you a "sketch on the back of a napkin" to get you started:

You could leverage the "sparse file" support of NTFS to simply add "gaps" by tweaking the LCN/VCN mappings. Once you do, just open the file, seek to the new location and write your data. NTFS will transparently allocate the space and write the data in the middle of the file, where you created a hole.

For more, look at this page about defragmentation support in NTFS for hints on how you can manipulate things a bit and allow you to insert clusters in the middle of the file. At least by using the sanctioned API for this sort of thing, you are unlikely to corrupt the filesystem beyond repair, although you can still horribly hose your file, I guess.

Get the retrieval pointers for the file that you want, split them where you need, to add as much extra space as you need, and move the file. There's an interesting chapter on this sort of thing in the Russinovich/Ionescu "Windows Internals" book (http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-Developer/dp/0735625301)

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.

I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.

Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.

Is there any way I can bring down the time taken when I run the program first time??

Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)

The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.

some good reading: NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1

FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.

When say 3 programs (executables) are loaded into memory the layout might look something like this:

alt text

I've following questions:

  1. Is the concept of Virtual Memory limited to user processes? Because, I am wondering where does the Operating System Kernel, Drivers live? How is its memory layout? I want to know more about kernel side memory. I know its operating system specific make your choice (windows/linux).

  2. Is the concept of Virtual Memory per process basis? I mean is it correct for me to say 4GB of process1 + 4GB of process2 + 4GB of process3 = 12GB of virtual memory (for all processes). This does't sound right. Or from a total of 4GB space 1GB is taken by kernel & rest 3GB is shared b/w all processes.

  3. They say, on a 32 bit machine in a 4GB address space. Half of it (or more recently 1GB) is occupied by kernel. I can see in this diagram that "Kernel Virtual memory" is occupying 0xc0000000 - 0xffffffff (= 1 GB). Are they talking about this? or is it something else? Just want to confirm.

  4. What exactly does the Kernel Virtual Memory of each of these processes contain? What is its layout?

  5. When we do IPC we talk about shared memory. I don't see any memory shared between these processes. Where does it live?

  6. Resources (files, registries in windows) are global to all processes. So, the resource/file handle table must be in some global space. Which area would that be in?

  7. Where can I know more about this kernel side stuff.

  1. When a system uses virtual memory, the kernel uses virtual memory as well. Windows will use the upper 2GB (or 1GB if you've specified the /3GB switch in the Windows bootloader) for its own use. This includes kernel code, data (or at least the data that is paged in -- that's right, Windows can page out portions of the kernel address space to the hard disk), and page tables.

  2. Each process has its own VM address space. When a process switch occurs, the page tables are typically swapped out with another process's page table. This is simple to do on an x86 processor - changing the page table base address in the CR3 control register will suffice. The entire 4GB address space is replaced by tables replacing a completely different 4GB address space. Having said that, typically there will be regions of address space that are shared between processes. Those regions are marked in the page tables with special flags that indicate to the processor that those areas do not need to be invalidated in the processor's translation lookaside buffer.

  3. As I mentioned earlier, the kernel's code, data, and the page tables themselves need to be located somewhere. This information is located in the kernel address space. It is possible that certain parts of the kernel's code, data, and page tables can themselves be swapped out to disk as needed. Some portions are deemed more critical than others and are never swapped out at all.

  4. See (3)

  5. It depends. User-mode shared memory is located in the user-mode address space. Parts of the kernel-mode address space might very well be shared between processes as well. For example, it would not be uncommon for the kernel's code to be shared between all processes in the system. Where that memory is located is not precise. I'm using arbitrary addresses here, but shared memory located at 0x100000 in one process might be located at 0x101000 inside another process. Two pages in different address spaces, at completely different addresses, can point to the same physical memory.

  6. I'm not sure what you mean here. Open file handles are not global to all processes. The file system stored on the hard disk is global to all processes. Under Windows, file handles are managed by the kernel, and the objects are stored in the kernel address space and managed by the kernel object manager.

  7. For Windows NT based systems, I'd recommend Windows Internals, 5ed by Mark Russinovich and David Solomon

Response to comment:

And now this 3GB is shared b/w all processes? or each process has 4GB space?

It depends on the OS. Some kernels (such as the L4 microkernel) use the same page table for multiple processes and separate the address spaces using segmentation. On Windows each process gets its own page tables. Remember that even though each process might get its own virtual address space, that doesn't mean that the physical memory is always different. For example, the image for kernel32.dll loaded in process A is shared with kernel32.dll in process B. Much of the kernel address space is also shared between processes.

Why does each process have kernel virtual memory?

The best way to think of this is to ask yourself, "How would a kernel work if it didn't execute using virtual memory?" In this hypothetical situation, every time your program caused a context switch into the kernel (let's say you made a system call), virtual memory would have to be disabled while the CPU was executing in kernel space. There's a cost to doing that and there's a cost to turning it back on when you switch back to user space.

Furthermore, let's suppose that the user program passed in a pointer to some data for its system call. This pointer is a virtual address. You've got virtual memory turned off, so that pointer needs to be translated to a physical address before the kernel can do anything with it. If you had virtual memory turned on, you'd get that for free thanks to the memory-management unit on the CPU. Instead you'd have to manually translate the addresses in software. There's all kinds of examples and scenarios that I could describe (some involving hardware, some involving page table maintenance, and so on) but the gist of it is that it's much easier to have a homogeneous memory management scheme. If user space is using virtual memory, it's going to be easier to write a kernel if you maintain that scheme in kernel space. At least that has been my experience.

there will be only one instnace of OS kernel right? then why each process has seperate kernel virtual space?

As I mentioned above, quite a bit of that address space will be shared across processes. There is per-process data that is in the kernel space that gets swapped out during a context switch between processes, but lots of it is shared because there is only one kernel.

I've been programming for about 11 years by now, and used a lot of different programming languages ranging from Python to C.

However, what I'm ashamed of is that I'm still missing a lot of the lower-level basic knowledge on which all of this is built on:

  • How exactly are stack and heap of executables built up and how do they work

  • How does a CPU work

  • What is a clock cycle

  • What is a data bus

  • How do north and southbridge on my motherboard work

  • Low level binary logic / calculations

Those are just examples, what I'm searching for is some good introduction on this, as I feel that this is simply required knowledge to become a good programmer.

I'm sure there are online resources for this type of thing, but this is also pretty nicely covered in a Computer Architecture course like this one. I also rather liked the book for that course.

However, it didn't really cover enough of the practical x86 side of things for my liking (we designed a MIPS processor and wrote assembly code for it and eventually a C compiler for it).

To fill in the gaps for what was different between our contrived example and my actual machine, I suggest the Windows Internals book. And possibly taking an OSR course.

If you're more on the Linux side, there are similar courses and books.

Two suggestions.

Some books:

Windows Internals (though not all info applies to other OS'es, obviously)

Write Great Code: Volume 1 (and perhaps subsequent volumes)

The Art of Assembly Language (ties in with 2nd suggestion)

Learn assembly language:

Assembly language is very low-level. In fact, its just a human-readable form of machine code (the ones and zeros, that CPU's understand). To understand assembly language, you must understand the low-level workings. This is because very little (if anything) is automatically managed for you, unlike in higher level languages like C# and Java.

In my opinion the best way to learn it by having fun. Learning compilers, system design and architecture is a lot of fun working with micro-processing interfacing. So my suggestion is to start to get hands on with an Atmel AVR kit or Motorola MSP kit. Another starting point is to make a micro simulator in any language of your preference and simulate the SRC Simple RISC computer this material, which is from this book.

This is the project I made in class using an MSP430, again it was a lot of fun.

I wondered if any of you have knowledge of the internal workings of windows (kernel, interrupts, etc) and if you've found that you've become a better developer as a result?

Do you find that the more knowledge the better is a good motto to have as a developer?

I find myself studying a lot of things, thinking with more understanding, I'll be a better developer. Of course practice and experience also comes into play.

This is a no brainier - absolutely (assuming you're a developer primarily on the Windows platform, of course). A working knowledge of how the car engine works will make a lot of common programming tasks (debugging, performance work, etc) a lot easier.

Windows Internals is the standard reference.

I have been a .NET developer since I started coding. I would like to learn Win32 programming. Need advice on where to start. What are the best resources /books for learining Win32 programming. I know a bit 'college C++'.

If you are interested in UI development, the best book for direct Win32 development in C or C++ (no MFC) is Programming Windows by Charles Petzold

For other sorts of Win32 development, such as threading, memory, DLL's, etc., Windows via C/C++ by Jeffrey Richter is a great book.

For general Windows architecture, Windows Internals by David Solomon and Mark Russinovich is a great resource.

While you're at it, pick up this book:

C++ Pointers and Dynamic Memory Management

It's old (circa 1995) but it's one of the best books to demystify pointers. If you ever found yourself blindly adding *'s or &'s to get your code to compile you probably need to read this.

What is the purpose of the csrss.exe (Client/Server Runtime Server Subsystem) on Windows?

Maybe someone could give a good explanation or pointers to documentation? Unfortunately Google results are pretty noisy when searching a core process of Windows.

The reason I'm asking is that I got a BSOD from my service application which seems to be related to the csrss.exe process, at least this is what the analysis of the memory dump shows:

PROCESS_OBJECT: 85eeeb70

IMAGE_NAME:  csrss.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  0
MODULE_NAME: csrss
FAULTING_MODULE: 00000000 
PROCESS_NAME:  PreviewService.
BUGCHECK_STR:  0xF4_PreviewService.
DEFAULT_BUCKET_ID:  DRIVER_FAULT
CURRENT_IRQL:  0
LAST_CONTROL_TRANSFER:  from 80998221 to 80876b40

STACK_TEXT:  
f5175d00 80998221 000000f4 00000003 85eeeb70 nt!KeBugCheckEx+0x1b
f5175d24 8095b1be 8095b1fa 85eeeb70 85eeecd4 nt!PspCatchCriticalBreak+0x75
f5175d54 8082350b 00000494 ffffffff 051bf114 nt!NtTerminateProcess+0x7a
f5175d54 7c8285ec 00000494 ffffffff 051bf114 nt!KiFastCallEntry+0xf8
051bf114 00000000 00000000 00000000 00000000 ntdll!KiFastSystemCallRet

STACK_COMMAND:  kb
FOLLOWUP_NAME:  MachineOwner
FAILURE_BUCKET_ID:  0xF4_PreviewService._IMAGE_csrss.exe
BUCKET_ID:  0xF4_PreviewService._IMAGE_csrss.exe

Followup: MachineOwner

EDIT: Thanks already for the good answers, but I actually don't need help concerning my service, I just would like to get some basic understanding of what the purpose of this service is.

CSRSS hosts the server side of the Win32 subsystem. It is considered a system critical process, and if it is ever terminated you'll get a blue screen. More data is necessary, but you need to find out if some process is terminating csrss, or if it is crashing due to a bug.

Windows Internals is a great book for stuff like this. Wikipedia also has an article on CSRSS.

When my task manager (top, ps, taskmgr.exe, or Finder) says that a process is using XXX KB of memory, what exactly is it counting, and how does it get updated?

In terms of memory allocation, does an application written in C++ "appear" different to an operating system from an application that runs as a virtual machine (managed code like .NET or Java)?

And finally, if memory is so transparent - why is garbage collection not a function-of or service-provided-by the operating system?


As it turns out, what I was really interested in asking is WHY the operating system could not do garbage collection and defrag memory space - which I see as a step above "simply" allocating address space to processes.

These answers help a lot! Thanks!

This is a big topic that I can't hope to adequately answer in a single answer here. I recommend picking up a copy of Windows Internals, it's an invaluable resource. Eric Lippert had a recent blog post that is a good description of how you can view memory allocated by the OS.

Memory that a process is using is basically just address space that is reserved by the operating system that may be backed by physical memory, the page file, or a file. This is the same whether it is a managed application or a native application. When the process exits, the operating system deletes the memory that it had allocated for it - the virtual address space is simply deleted and the page file or physical memory backings are free for other processes to use. This is all the OS really maintains - mappings of address space to some physical resource. The mappings can shift as processes demand more memory or are idle - physical memory contents can be shifted to disk and vice versa by the OS to meet demand.

What a process is using according to those tools can actually mean one of several things - it can be total address space allocated, total memory allocated (page file + physical memory) or memory a process is actually using that is resident in memory. Task Manager has a separate column for each of these possibilities.

The OS can't do garbage collection since it has no insight into what that memory actually contains - it just sees allocated pages of memory, it doesn't see objects which may or may not be referenced.

Whereas the OS handles allocates at the virtual address level, in the process itself there are other memory managers which take these large, page-sized chunks and break them up into something useful for the application to use. Windows will return memory allocated in 64k boundaries, but then the heap manager breaks it up into smaller chunks for use by each individual allocation done by the program via new. In .Net applications, the CLR will hand off new objects off of the garbage collected heap and when that heap reaches its limits, will perform garbage collection.

I have a C++ program which reads files from the hard disk and does some processing on the data in the files. I am using standard Win32 APIs to read the files. My problem is that this program is blazingly fast some times and then suddenly slows down to 1/6th of the previous speed. If I read the same files again and again over multiple runs, then normally the first run will be the slowest one. Then it maintains the speed until I read some other set of files. So my obvious guess was to profile the disk access time. I used perfmon utility and measured the IO Read Bytes/sec for my program. And as expected there was a huge difference (~ 5 times) in the number of bytes read. My questions are:

(1). Does OS (Windows in my case) cache the recently read files somewhere so that the subsequent loads are faster?

(2). If I can guarantee that all the files I read reside in the same directory then is there any way I can place them in the hard disk so that my disk access time is faster?

Is there anything I can do for this?

1) Windows does cache recently read files in memory. The book Windows Internals includes an excellent description of how this works. Modern versions of Windows also use a technology called SuperFetch which will try to preemptively fetch disk contents into memory based on usage history and ReadyBoost which can cache to a flash drive, which allows faster random access. All of these will increase the speed with which data is accessed from disk after the initial run.

2) Directory really doesn't affect layout on disk. Defragmenting your drive will group file data together. Windows Vista on up will automatically defragment your disk. Ideally, you want to do large sequential reads and minimize your writes. Small random accesses and interleaving writes with reads significantly hurts performance. You can use the Windows Performance Toolkit to profile your disk access.

In C#, when I create a new thread and execute a process on that thread, is there a way to assign it to a particular core? Or does the operating system handle all that automatically? I wrote a multi threaded application and I just want to be sure its optimized for dual/quad core functionality.

Thanks

You can force your threads to run on specific cores, but in general you should let the OS take care of it. The operating system handles much of this automatically. If you have four threads running on a quad core system, the OS will schedule them on all four cores unless you take actions to prevent it from happening. The OS will also do things like try to keep an individual thread running on the same core rather than shifting them around for better performance, not schedule two running threads on the same hyperthreaded core if there are idle cores available, and so on.

Also, rather than creating new threads for work you should use the thread pool. The system will scale this to the number of processors available on the system.

Windows Internals is a good book for covering how the Windows scheduler works.

I recently was made aware of this thing called IOCP on windows and i began searching for more information on it but i couldn't find anything up to date (most of the examples were on codeproject almost 5 years old) and not too many guides or tutorials. Can anyone recommend any up to date resources about it in the form of online tutorials or example projects (that you wrote and can share or other open source projects) or even a book about it because if it's as good as it sounds i plan to use it extensively so i will invest in it.

Thank You.

IOCP is a feature that has been in Windows since the dark ages and has changed little in years since. As such, any samples etc. from 5+ years ago should still work pretty well today.

MSDN has some documentation on IOCP: http://msdn.microsoft.com/en-us/library/aa365198%28v=VS.85%29.aspx

Mark Russinovich also wrote up a great intro into IOCP: http://sysinternals.d4rk4.ru/Information/IoCompletionPorts.html

Mark also wrote a more thorough description of Windows' IO infrastructure in "Windows Internals" which is essential reading.

I also strongly recommend Jeffery Richter's "Windows via C/C++" which is also essential reading for anyone embarking on lower-level Windows programming.

HTH.

If you're looking at IOCP from a Network programming point of view then you probably also want to add Network Programming for Microsoft Windows to your list of resources.

There were lots of basic IOCP tutorials on CodeProject back in 2002 when I wrote my articles on IOCP there, so I took a slightly different approach and wrote some code that was, hopefully, reusable as a simple networking framework. This has since grown into a product that I sell. The latest version of the code that's associated with the original CodeProject articles can be found here: http://www.serverframework.com/products---the-free-framework.html I've changed it considerably over the years but the original code still works fine and provides good scalability and is, perhaps, useful as a working example to learn from.

How do programs made for windows interact with, or issue commands to, the kernel of Windows NT?

And how does the kernel return any data?

Dude, that's a very broad question.

I recommend the book Windows Internals By Mark Russinovich et ag if you really really want to under stand this. Another good book is the classic Operating Systems by Deitel et al.

Start however with Inside Windows NT by Helen Custer (edition 1) - it's a very basic book (note that last link has a pic of the coverof edition 2 - which is way way way more detailed).

Ok in a nutshell.

There are a variety of protocols for communcation between windows components. Most of them will employ passing data via some shared memory (such as buffers, stack etc) at the end of the day. But the protocols can be very involved and are different for different communications.

My suggestion to you is have a loock at the above books and determine how the architecture of the Windows operating system hangs together. From here you'll see how the various components communicate.

(applying nerd face) - Trust me those are great books for learning about Windows and operating systems in general if that's what floats your boat.

I had a leaking handle problem ("Not enough quota available to process this command.") in some inherited C# winforms code, so I went and used Sysinternals' Handle tool to track it down. Turns out it was Event Handles that were leaking, so I tried googled it (took a couple tries to find a query that didn't return "Did you mean: event handler?"). According to Junfeng Zhang, event handles are generated by the use of Monitor, and there may be some weird rules as far as event handle disposal and the synchonization primitives.

I'm not entirely sure that the source of my leaking handles are entirely due to simply long-lived objects calling lots of synchronization stuff, as this code is also dealing with HID interfaces and lots of win32 marshaling and interop, and was not doing any synchronization that I was aware of. Either way, I'm just going to run this in windbg and start tracing down where the handles are originating from, and also spend a lot of time learning this section of the code, but I had a very hard time finding information about what event handles are in the first place.

The msdn page for the event kernel object just links to the generic synchronization overview... so what are event handles, and how are they different from mutexes/semaphores/whatever?

The NT kernel uses event objects to allow signals to transferred to entities that wait on the signal. A mutex and a semaphore are also waitable kernel objects (Kernel Dispatcher Objects), but with different semantics. The only time I ever came across them was when waiting for IO to complete in drivers.

So my theory on your problem is possibly a faulty driver, or are you relying on specialised hardware?

Edit: More info (from Windows Internals 5th Edition - Chapter 3 System Mechanics)

Some Kernel Dispatcher Objects (e.g. mutex, semaphore) have the of concept ownership. So when signalled the released one waiting thread will be released will grab these resources. And others will have to continue to wait. Events are not owned hence are available to be reset by any thread.

Also there are three types of events:

  • Notification : On signalled all waiting threads are released
  • Synchronisation : On signalled one waiting thread is released but the event is reset
  • Keyed : On signalled one waiting thread in the same process as the signaller is released.

Another interesting thing that I've learned is that critical sections (the lock primitive in c#) are actually not kernel objects, rather they are implemented out of a keyed event, or mutex or semaphore as required.

I'm currently learning about the different modes the Windows operating system runs in (kernel mode vs. user mode), device drivers, their respective advantages and disadvantages and computer security in general.

I would like to create a practical example of what a faulty device driver that runs in kernel mode can do to the system, by for example corrupting memory used for critical OS-processes.

  • How can I execute my code in kernel mode instead of user mode, directly?
  • Do I have to write a dummy device driver and install it to do this?

  • Where can I read more about kernel and user mode in Windows?

I know the dangers of this and will do all of the experiments on a virtual machine running Windows XP only

You will need a good understanding of Windows Internals:

http://technet.microsoft.com/en-us/sysinternals

and yes they have a book: Windows Internals

http://technet.microsoft.com/en-us/sysinternals/bb963901

http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301

Basically your questions are all answered in this book (and it even comes with samples and hands-on labs).

In order to find more easily buffer overflows I am changing our custom memory allocator so that it allocates a full 4KB page instead of only the wanted number of bytes. Then I change the page protection and size so that if the caller writes before or after its allocated piece of memory, the application immediately crashes.

Problem is that although I have enough memory, the application never starts up completely because it runs out of memory. This has two causes:

  • since every allocation needs 4 KB, we probably reach the 2 GB limit very soon. This problem could be solved if I would make a 64-bit executable (didn't try it yet).
  • even when I only need a few hundreds of megabytes, the allocations fail at a certain moment.

The second problem is the biggest one, and I think it's related to the maximum number of PTE's (page table entries, which store information on how Virtual Memory is mapped to physical memory, and whether pages should be read-only or not) you can have in a process.

My questions (or a cry-for-tips):

  • Where can I find information about the maximum number of PTE's in a process?
  • Is this different (higher) for 64-bit systems/applications or not?
  • Can the number of PTE's be configured in the application or in Windows?

Thanks,

Patrick

PS. note for those who will try to argument that you shouldn't write your own memory manager:

  • My application is rather specific so I really want full control over memory management (can't give any more details)
  • Last week we had a memory overwrite which we couldn't find using the standard C++ allocator and the debugging functionality of the C/C++ run time (it only said "block corrupt" minutes after the actual corruption")
  • We also tried standard Windows utilities (like GFLAGS, ...) but they slowed down the application by a factor of 100, and couldn't find the exact position of the overwrite either
  • We also tried the "Full Page Heap" functionality of Application Verifier, but then the application doesn't start up either (probably also running out of PTE's)

In order to find more easily buffer overflows I am changing our custom memory allocator so that it allocates a full 4KB page instead of only the wanted number of bytes.

This has already been done. Application Verifier with PageHeap.

Info on PTEs and the Memory architecture can be found in Windows Internals, 5th Ed. and the Intel Manuals.

Is this different (higher) for 64-bit systems/applications or not?

Of course. 64bit Windows has a much larger address space, so clearly more PTEs are needed to map it.

Where can I find information about the maximum number of PTE's in a process?

This is not so important as the maximum amount of user address space available in a process. (The number of PTEs is this number divided by the page size.)

This is 2GB on 32 bit Windows and much bigger on x64 Windows. (The actual number varies, but it's "big enough").

Problem is that although I have enough memory, the application never starts up completely because it runs out of memory.

Are you a) leaking memory? b) using horribly inefficient algorithms?

I've always been curious about

  1. How exactly the process looks in memory?
  2. What are the different segments(parts) in it?
  3. How exactly will be the program (on the disk) & process (in the memory) are related?

My previous question: more info on Memory layout of an executable program (process)

In my quest, I finally found a answer. I found this excellent article that cleared most of my queries: http://www.linuxforums.org/articles/understanding-elf-using-readelf-and-objdump_125.html

In the above article, author shows how to get different segments of the process (LINUX) & he compares it with its corresponding ELF file. I'm quoting this section here:

Courious to see the real layout of process segment? We can use /proc//maps file to reveal it. is the PID of the process we want to observe. Before we move on, we have a small problem here. Our test program runs so fast that it ends before we can even dump the related /proc entry. I use gdb to solve this. You can use another trick such as inserting sleep() before it calls return().

In a console (or a terminal emulator such as xterm) do:

$ gdb test
(gdb) b main
Breakpoint 1 at 0x8048376
(gdb) r
Breakpoint 1, 0x08048376 in main ()

Hold right here, open another console and find out the PID of program "test". If you want the quick way, type:

$ cat /proc/`pgrep test`/maps

You will see an output like below (you might get different output):

[1]  0039d000-003b2000 r-xp 00000000 16:41 1080084  /lib/ld-2.3.3.so
[2]  003b2000-003b3000 r--p 00014000 16:41 1080084  /lib/ld-2.3.3.so
[3]  003b3000-003b4000 rw-p 00015000 16:41 1080084  /lib/ld-2.3.3.so
[4]  003b6000-004cb000 r-xp 00000000 16:41 1080085  /lib/tls/libc-2.3.3.so
[5]  004cb000-004cd000 r--p 00115000 16:41 1080085  /lib/tls/libc-2.3.3.so
[6]  004cd000-004cf000 rw-p 00117000 16:41 1080085  /lib/tls/libc-2.3.3.so
[7]  004cf000-004d1000 rw-p 004cf000 00:00 0
[8]  08048000-08049000 r-xp 00000000 16:06 66970    /tmp/test
[9]  08049000-0804a000 rw-p 00000000 16:06 66970    /tmp/test
[10] b7fec000-b7fed000 rw-p b7fec000 00:00 0
[11] bffeb000-c0000000 rw-p bffeb000 00:00 0
[12] ffffe000-fffff000 ---p 00000000 00:00 0

Note: I add number on each line as reference.

Back to gdb, type:

(gdb) q

So, in total, we see 12 segment (also known as Virtual Memory Area--VMA).

But I want to know about Windows Process & PE file format.

  1. Any tool(s) for getting the layout (segments) of running process in Windows?
  2. Any other good resources for learning more on this subject?

EDIT:

Are there any good articles which shows the mapping between PE file sections & VA segments?

Run "!address" in WinDbg on the running process. You will see every virtual memory segment in the process with some classification - image, memory mapped file, stack, heap, PEB, TEB, etc.

Windows Internals is always a good reference for things like this.

Here's the first few entries for notepad:

        BaseAddress      EndAddress+1        RegionSize     Type       State                 Protect             Usage
----------------------------------------------------------------------------------------------------------------------
*        0`00000000        0`00be0000        0`00be0000             MEM_FREE    PAGE_NOACCESS                      Free 
*        0`00be0000        0`00bf0000        0`00010000 MEM_MAPPED  MEM_COMMIT  PAGE_READWRITE                     MemoryMappedFile "PageFile"
*        0`00bf0000        0`00bf7000        0`00007000 MEM_MAPPED  MEM_COMMIT  PAGE_READONLY                      MemoryMappedFile "PageFile"
*        0`00bf7000        0`00c00000        0`00009000             MEM_FREE    PAGE_NOACCESS                      Free 
*        0`00c00000        0`00c03000        0`00003000 MEM_MAPPED  MEM_COMMIT  PAGE_READONLY                      MemoryMappedFile "PageFile"
*        0`00c03000        0`00c10000        0`0000d000             MEM_FREE    PAGE_NOACCESS                      Free 
*        0`00c10000        0`00c12000        0`00002000 MEM_MAPPED  MEM_COMMIT  PAGE_READONLY                      MemoryMappedFile "PageFile"
*        0`00c12000        0`00c20000        0`0000e000             MEM_FREE    PAGE_NOACCESS                      Free 
*        0`00c20000        0`00c21000        0`00001000 MEM_PRIVATE MEM_COMMIT  PAGE_READWRITE                     <unclassified> 
*        0`00c21000        0`00c30000        0`0000f000             MEM_FREE    PAGE_NOACCESS                      Free 
*        0`00c30000        0`00c97000        0`00067000 MEM_MAPPED  MEM_COMMIT  PAGE_READONLY                      MemoryMappedFile "\Device\HarddiskVolume2\Windows\System32\locale.nls"

Another virtual memory viewer is VMValidator. Visual data of memory layout, plus data on memory pages and memory paragraphs.

As for layout of PE files, I recommend the book Expert .Net 2.0 IL Assembler, chapter 4. Its principally aimed at a managed (.Net) PE file rather than a native one, but it does describe how its all laid out.

Then if you want to see some source code (C++) that reads a PE file you should take a look at PE File Format DLL. There is also a GUI that shows you how to use the DLL. The license for the source is open source and not restricted by the GPL.

EDIT: Another book recommendation would be Inside Microsoft Windows 2000 (3rd Edition) by David A Solomon and Mark E Russinovitch (the guys that wrote VMMap mentioned in a different answer). This book has sections on Memory management right from the Page Table layout through to more macro scale memory management and another chapter all about various issues to do with Process, Threads and related data structures.

Regarding PE layout and Virtual Address layout, a DLL is loaded into a memory area that is on a paragraph boundary (64K on x86), allocated by VirtualAlloc(). The memory protection of the various pages (4K on x86, 8K on x64) inside this is set according to how each section is described in the PE file (read only, read/execute, read/write), etc. Thus knowing the PE file layout is useful, which is why I mentioned it.

If you are planning on experimenting with modifying DLLs or performing instrumentation, having a tool to allow you to easily view the DLL contents is very useful. Hence the link to the PE File Format DLL. Its also a good base to start from for your own specific requirements.

Here's another question I met when reading < Windows via C/C++ 5th Edition >. First, let's see some quotation.

LPVOID WINAPI VirtualAlloc(
  __in_opt  LPVOID lpAddress,
  __in      SIZE_T dwSize,
  __in      DWORD fdwAllocationType,
  __in      DWORD fdwProtect
);

The last parameter, fdwProtect, indicates the protection attribute that should be assigned to the region. The protection attribute associated with the region has no effect on the committed storage mapped to the region.

When reserving a region, assign the protection attribute that will be used most often with the storage committed to the region. For example, if you intend to commit physical storage with a protection attribute of PAGE_READWRITE, you should reserve the region with PAGE_READWRITE. The system's internal record keeping behaves more efficiently when the region's protection attribute matches the committed storage's protection attribute.

(When commiting storage)...you usually pass the same page protection attribute that was used when VirtualAlloc was called to reserve the region, although you can specify a different protection attribute.

The above quotation totally puzzled me.

  • If the protection attribute associated with the region has no effect on the committed storage, why do we need it?

  • Since it is recommended to use the same protection attribute for both reserving and committing, why does Windows still offer us the option to use different attribute? Isn't it mis-leading and kind of a paradox?

  • Where exactly is the protection attribute stored for reserved region and committed storage, repectively?

Many thanks for your insights.

It's important to read it in context.

The protection attribute associated with the region has no effect on the committed storage mapped to the region.

was referring to reserving, not committing regions.

A reserved page has no backing store, so it's protection is always conceptually PAGE_NOACCESS, regardless of what you pass to VirtualAlloc. I.e. if a thread attempts to read/write to an address in a reserved region, an access violation is raised.

From linked article:

Reserved addresses are always PAGE_NOACCESS, a default enforced by the system no matter what value is passed to the function. Committed pages can be either read-only, read-write, or no-access.

Re:

  • Where exactly is the protection attribute stored for reserved region and committed storage, repectively?

The protection attributes for virtual address regions are stored in the VAD tree, per process. (VAD == Virtual Address Descriptor, see Windows Internals, or linked article)

Since it is recommended to use the same protection attribute for both reserving and committing, why does Windows still offer us the option to use different attribute? Isn't it mis-leading and kind of a paradox?

Because the function always accepts a protection parameter, but its behaviour depends on fdwAllocationType. Protection only makes sense for committed storage.

The reason Richter suggests using the same protection setting is presumably because fewer changes in the protection flags in a region mean fewer "blocks" (see your book for definition), and hence a smaller AVL tree for the VADs. I.e. if all pages in a region are committed with the same flags, there'll only be 1 block. Otherwise there could be as many blocks as pages in the region. And you need a VAD for each block (not page).

Block == set of consecutive pages with identical protection/state.

If the protection attribute associated with the region has no effect on the committed storage, why do we need it?

As above.

I'm wondering how long it takes (in milliseconds) to read a registry value from the Windows registry through standard C# libraries. In this case, I'm reading in some proxy settings.

What order of magnitude value should I expect? Are there any good benchmark data available?

I'm running WS2k8 R2 amd64. Bonus points: How impactful is the OS sku/version on this measure?

 using (RegistryKey registryKey = Registry.CurrentUser.OpenSubKey(@"Software/Copium")) 
 { 
      return (string)registryKey.GetValue("BinDir"); 
 } 

I cannot quote numbers as I don't know. But having just read 30 pages in the Windows Internals 5 book about the registry the following noteworthy things that I didn't know became clear.

  • The Registry is transactional and has fail safes to prevent from being corrupted. This can affect performance. Since the transactional level is read committed, reads shouldn't be blocked by writes so they should be performant.

  • The registry is cached in memory (well frequently used values anyway) so if you access a set of keys often the performance should remain stable after the first hit.

I need to create a driver that presents itself to Windows as a video capture driver. The driver generates the video itself. How would I go about doing this? And please keep in mind that I'm using Visual C++ Express.

I'm not sure you can do this with a UMD, so you'll likely need to install the WDK. You probably all ready know this, but writing a driver is a huge undertaking so you should be prepared for that.

Here's a link on writing Windows drivers from MSDN. I'd also suggest you pick up a copy of the Windows Internals book, and check out OSR (and take a class if you can!).

Hope that helps you get started!

I need to learn the basic knowledge of OS, kernel and CPU architectures since some jobs do require those background.

Is there a good book or online resource that I can refer to.

I don't know if you had a specific OS in mind, but one of the best books on how the Windows operating system works "under the hood" is called Windows Internals. It describes in detail how everything from the kernel, to device drivers, and the file system all work.

If your looking for a good book on how CPUs and processors work, in general, I recommend Computer Architecture: A Quantitative approach. Very good info there!

Also, some good resources on how CPUs work, with perspective to programmers, can be found from the Intel technical library. Everything is free to download there and it makes for some good reading!

I've been assigned the job of testing a small Windows application for the company I work for. I'm a little experienced with testing web applications using the Google Chrome Developer Tools. Apart from that, I don't know much.

For the moment, I manual test keeping an eye on the Windows Task Manager for memory and CPU usage.

What other basic tools should I be using to do manual (as opposed to unit testing) Windows application testing?

There're a number of tools that can be handy:

Process Explorer from SysInternals is much more useful that the task manager.

Off top of my head, here are a few things that you can do without modifying the code or writing test code:

  • see if there're memory leaks or corruptions (use Application Verifier + WinDbg)
  • inject failures (that is, at some point modify a status/error code/pointer/some other variable in the debugger as if a piece of code failed to open a file or allocate memory or do something else) and see if the app gracefully handles that

Play with SysInternals tools.

Also, it may be a good idea to buy this book to familiarize yourself with Windows: http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-Developer/dp/0735625301/

There're also a few good ones on debugging Windows applications, like this one: http://www.amazon.com/Advanced-Windows-Debugging-Mario-Hewardt/dp/0321374460/
Among the other things it explains how to automatically collect crash dumps from your applications (using Windows Error Reporting AKA WER) and then inspect them in the debugger. I found that useful.

Seriously, I've trawled MSDN and only got half answers - what do the columns on the Task Manager mean? Why can't I calculate the VM Usage by enumerating threads, modules, heaps &c.? How can I be sure I am accurately reporting to clients of my memory manager how much address space is left? Are their myriad collisions in the memory glossary namespace?

An online resource would be most useful in the short term, although books would be acceptable in the medium term.

Mark Russinovich has written the excellent book Windows Internals. A new edition that covers the Vista and Server 2008 operating systems is currently in the works with David Solomon, so you may want to pre-order that if your questions are about the new Windows operating systems instead of the old ones.

For academic and task related purposes I need to know how is file related data associated within files on NTFS and EXT. How does the operating system know file's name? How do editors know in which encoding to treat the file contents?

Are these details stored on a separate information location on the NTFS/EXT or are they included within the file itself?

On NTFS such information is stored not in the file itself but in the master file table (MFT).

You are asking many questions. I suggest you read up on the subject. Here is the short version, and here is everything in full detail.

Having quite a issue here. What I'm trying to archive is a application which won't be seen as a keylogger. I got this application which i would like to do stuff on key press while you are inside another game. But the way i found my self using was scanned and it shown as a key logger due to global key registration.

Is there any way to avoid this in order to make such application?

It's illegal if you are working on a harmfull software. If so, I'm suggesting > that wiki <

You should search with these terms;

"What is Ring 0"

"What is Kernel Mode"

"Ring 0 Keylogger"

"Take Key Press Events Sys Driver" etc.

Kernel mode keyloggers are hard to detect. But they can be detected too. They are not invisible completly. Here are the source codes that I found; Ring 0 Keylogger and Keyboard Monitoring

Also I'm suggesting that book, nice too read and learn win os.

Good luck !

I am studying how to use MiniDumpWriteDump() method to create minidumps. After I read some articles, I got the feeling that all I can do is to provide some callback function and various flags to tell the OS what I want to dump. Then OS will collect various info such as call stacks into a dump file.

But is this all I can do? I don't want to use the so-called APIs, it makes me feel like swimming in the bathtub, not the ocean. Is there any way else to examine the computer memory freely? Could anyone provide some reference to achieve that?

Many thanks.

You can, however see another process's memory you do need to be in kernel mode. The API makes it easy to do from User mode. Your choice.

Kernel mode stuff and useful links I've grabbed quickly:

I have a fairly complex (approx 200,000 lines of C++ code) application that has decided to crash, although it crashes a little differently on a couple of different systems. The trick is that it doesn't crash or trap out in debugger. It only crashes when the application .EXE is run independently (either the debug EXE or the release EXE - both behave the same way). When it crashes in the debug EXE, and I get it to start debugging, the call stack is buried down into the windows/MFC part of things, and isn't reflecting any of my code. Perhaps I'm seeing a stack corruption of some sort, but I'm just not sure at the moment. My question is more general - it's about tools and techniques.

I'm an old programmer (C and assembly language days), and a relative newcomer (couple/few years) to C++ and Visual Studio (2003 for this projecT).

Are there tricks or techniques anyone's had success with in tracking down crashing issues when you cannot make the software crash in a debugger session? Stuff like permission issues, for example?

The only thing I've thought of is to start plugging in debug/status messages to a logfile, but that's a long, hard way to go. Been there, done that. Any better suggestions? Am I missing some tools that would help? Is VS 2008 better for this kind of thing?

Thanks for any guidance. Some very smart people here (you know who you are!).

cheers.

I couldn't recommend more the blog of Mark Rusinovich. Absolutely brilliant guy from whom you can learn a whole bunch of debugging techniques for windows and many more. Especially try read some of the "The Case of" series! Amazing stuff!

For example take a look at this case he had investigated - a crash of IE. He shows how to capture the stack of the failing thread and many more interesting stuff. His main tools are windows debugging tools and also his sysinternals tools!

Enough said. Go read it!

Also I would recommend the book: Windows Internals 5. Again by Mark and company.

I have been Coding in C# for about 3/5 Years in School. the problem is that i want to learn how you code such things as Keylogger and things like that. in School it most Problem Solving to learn us to think like programmers.

so how shall i learn to code Network/Security tools. Shall i buy a book about Network programming in C#? Or do you have any tips where to starts?

If you want to build things like keyloggers or network packet analyzers, you will have to go to a lower layer than what C# usually offers. You will need to learn a bit more about the operating system, its networking stack, how it interacts with the input devices and things like that.

Keep in mind that some of the things might not even be feasible in C#. You will have to do a lot of Win32 interop and even write C++ code on occasions.

I would recommend you look for a book that describes the Windows internals, for example Windows Internals by Mark Russinovich and David Solomon.

I know the cost of a physical Win32 thread context switch is estimated at between 2-8k cycles. Any estimates on the cost of a process switch?

A quote from "Windows Internals 5Ed":

Windows must determine which thread should run next. When Windows selects a new thread to run, it performs a context switch to it. A context switch is the procedure of saving the volatile machine state associated with a running thread, loading another thread’s volatile state, and starting the new thread’s execution.

Windows schedules at the thread granularity. This approach makes sense when you consider that processes don’t run but only provide resources and a context in which their threads run. Because scheduling decisions are made strictly on a thread basis, no consideration is given to what process the thread belongs to. For example, if process A has 10 runnable threads, process B has 2 runnable threads, and all 12 threads are at the same priority, each thread would theoretically receive one-twelfth of the CPU time—Windows wouldn’t give 50 percent of the CPU to process A and 50 percent to process B.

...

A thread’s context and the procedure for context switching vary depending on the processor’s architecture. A typical context switch requires saving and reloading the following data: A. Instruction pointer B. Kernel stack pointer C. A pointer to the address space in which the thread runs (the process’s page table directory). The kernel saves this information from the old thread by pushing it onto the current (old thread’s) kernel-mode stack, updating the stack pointer, and saving the stack pointer in the old thread’s KTHREAD block. The kernel stack pointer is then set to the new thread’s kernel stack, and the new thread’s context is loaded. If the new thread is in a different process, it loads the address of its page table directory into a special processor register so that its address space is available. Control passes to the new thread’s restored instruction pointer and the new thread resumes execution.

So the only overhead for thread-context switch to another process is as minimal as setting the value of one processor register - totally negligible.

I want to suspend or save a state of Windows Wista with all opened applications and close them (to have be reopened again later), and start a game (from Steam for example). And when I will quit a game, I want to resume (resume saved state, or resume from suspend) all opened applications to resume work where I left it. I have found this relevant resources:

Can you point me advice on how to create such a feature? I'm just starting with C++, where to look first for a start, and nothing more, just a starting point with resources. To create such a program for Windows Vista x64 Ultimate, I have learned pretty much about C++.

More about the same: Can I make to "Hibernate" only active programs, and leave windows OS active? Is this just need to launch another explorer.exe?

This sounds like a difficult project for a beginner. You'll have to learn quite a bit about the internals of the operating system: process control and how to manipulate virtual memory. A few books you might start with are Windows via C/C++, Windows System Programming, and Windows Internals. A new edition of Windows Internals is coming out in March.

Windows 7 has Heap randomization and Stack randomization features. How could I manage it? How they are affects performance of my application? Where I could find more information on how it works?

I'm using Visual Studio 2008 for developing C++ programs. I can't find any compiler's options for that features.

Ok, Heap randomization and Stack randomization are Windows features, but have to be explicitly enabled for each process at link time. Mark Russinovich described how it is work in his 5-th Windows Internals book.

Stack randomization consists of first selecting one of 32 possible stack locations separated by either 64 KB or 256 KB. This base address is selected by finding the first appropriate free memory region and then choosing the xth available region, where x is once again generated based on the current processor's TSC shifted and masked into a 5-bit value.<...>

Finally, ASLR randomizes the location of the initial process heap (and subsequent heaps) when created in user mode. The RtlCreateHeap function uses another pseudo-random, TSC-derived value to determine the base address of the heap. This value, 5 bits this time, is multiplied by 64 KB to generate the final base address, starting at 0, giving a possible range of 0x00000000 to 0x001F0000 for the initial heap. Additionally, the range before the heap base address is manually deallocated in an attempt to force an access violation if an attack is doing a brute-force sweep of the entire possible heap address range.