The Barking Pixel: September 2009

There is so much rubbish out there about 32-vs-64-bit machines. And let's be clear, I'm talking about the Pentium 32-bit machine (also known as X86 or IA-32) compared to the AMD64 machine (also known as X86_64 or EMT64), because, folks, there actually are other 64-bit and 32-bit processors out there.

The 64-bit machine can run 32-bit code, of course, and the usual question is: should I use a 64-bit OS and run 64-bit apps, or just pretend it's a fast 32-bit machine and run 32-bit stuff?

First off, there's a lot of stuff about 64-bit Windows not running old drivers or 16-bit apps. Not interested in that here, links below if you are.

The question is, what's so good about the 64-bit machine that I want to switch to new OS and application code?

Well, two answers are usually given. The first is that a 32-bit machine can only address 4GB of memory. The second is that 64 bit calculations allow 'greater precision' or something like that.

So, taking these in order: The memory limit is getting to be an issue, if only because 4GB of ram doesn't cost that much anymore. Vista is a huge RAM hog, needs at least 2GB of RAM to get decent behaviour, so that doesn't leave too much headroom, especially when you allow for all the data which the virus scanner will need to keep in memory. (BTW If you're wondering what happens on a laptop when you push 'hibernate' and all that has to be written to a laptop drive -- you're right). So, yes. This is a good thing. You either need it or not, depending on how much RAM you have. And there are some unusual cases where a single application will need to access more than 4GB of RAM at once and that's obviously only going to work on a 64-bit machine (the limit of 32-bits in this regard is actually rather less than 4GB, because of the way virtual address space is allocated).

The 'precision' issue is basically rubbish. If an application needs to do 64-bit integer operations, it will do them even when compiled for the 32-bit machine too. The difference is that the 64-bit machine can do a 64-bit 'add' in one instruction whereas the 32-bit machine will take several instructions. So don't expect any application to somehow generate more precise results on the 64-bit machine (I have seen people speculate on this). Even applications that do 64-bit calculations generally don't spend a lot of time actually doing that, so the speed issue doesn't make much difference. If an app really does spend a lot of time doing 64-bit operations, then both the 32-bit and 64-bit machines have 128-bit wide vector operators than can be used for this. It's also worth pointing out that 'precision' is usually associated with floating-point operations, and the 64-bit machine does not change the width of these.

But the real advantage of the 64-bit machine has nothing to with the width of the data registers, it is the number of them. The X86 machine is equipped with an anemic 6 (yes six) general purpose registers (count 'em: eax,ebx, ecx,edx,esi,edi). As a result of this, in typical code, there is relatively little information that can be kept in registers for any length of time, so it is necessary to do a lot of memory reads and writes, compared to all other 32-bit processors. By contrast, the 64-bit machines add 8 additional registers, for a total of 14 (all of which are 64-bits wide). Even when the extra data width gives no advantage, the extra registers mean that a given program can generally perform a given task with far fewer load/store operations. This also means that the size of the program may be a little smaller, which improves speed since less time is spent loading code from the disk, or from memory into the cache.

This improvement by itself leads to a significant increase in code speed, and ironically it is a feature that has nothing to do with 64 bits, it is something that Intel could have done to their 32-bit machine if they had had some vision in designing the architecture.

I've see 20%-25% speed improvement on the same C++ code running on the same machine, using -m32 and -m64 on the same version of gcc. Virtually none of that improvement is from using 64-bit wide operations.

More about the extra 8 registers in AMD64: It starts to get technical here, you have been warned.

The 6-register issue with the X86 dates back to the 8086, which had the same set of basic registers, and was designed to run 8080 code with little modification. To me, it's astounding what a poor job Intel did 'enhancing' their processor from the 16-bit to the 32-bit. The key point here - and for whatever reason, they seem to have completely missed it - is this: when you step across a major boundary, e.g. from the 16-bit to the 32-bit, you don't need to keep all of the instructions; you can drop the useless ones. Because no existing 16-bit code will run in 32-bit mode, you don't need to be compatible with that. You don't need to keep, for instance, the DAA instruction, which was included on the 8086 because the 8080 has it, but which is basically useless unless you are writing 8-bit assembler by hand. No compiler would ever make use of this instruction, but intel's 32-bit machine supports it. There are 3 other useless instructions (DAS, AAS, AAA) which similarly should have been left behind in the 16-bit machine.

Why drop useless instructions? Since these machines need to support those instructions in legacy mode, you won't really save much logic by dropping them. But, they are consuming prime real estate in the opcode space which you'd like to use for new, more powerful instructions. Those four instructions are each 1 byte long, so together they consume 1/64 of the entire opcode space of the machine. In order to add new instructions, Intel has always taken the approach of scraping up the few remaining 1-byte opcodes not used in the 16-bit processor, and using them as 'prefixes' for all the new instructions needed for the 32-bit machine. Thus, the new, powerful instructions are put in the suburbs (with needlessly long encodings) and the old, boring ones (some of which are still quite useful) get to keep hogging the prime real estate. What this means in practical terms is that it takes too many opcode bytes to get anything done. This issue, plus the impact of having only 6 registers, results in very poor code density in the X86. My experience is this: when you compile C code for X86 and for a 32-bit RISC processor such as ARM, you can expect the ARM version to be about 2/3 the size of the X86 version. This is ironic, since when the RISC/CISC debate was raging in the 80's, one of the claimed disadvantages of RISC was that it had lower code density. The X86 is the only surviving 32-bit CISC architecture (R.I.P. 68K, NS32K) and has far, far worse code density than its cohorts.

AMD didn't make the same mistake when designing the 64-bit machine. They decided to add more registers; but this means that all instructions which select a register need to have an extra bit squeezed in somehow. Also, you want to support basic operations (such as ADD) which operate on 8,16,32 and 64 bits, whereas the existing ones only operate on 8,16 or 32; so again that means adding a bit to a wide class of instructions.

So, they noticed another group of useless instructions. There is a one-byte INC instruction, which adds 1 to a register; in fact there are 8 of these, since you can add 1 to 8 different registers - the six general purpose, plus ESP and EBP. Likewise, there are 8 'DEC' instructions. That's a total of 16 opcodes - a whopping 1/16 of the entire opcode space. For every one of these instructions there is an equivalent 2-byte encoding using the general 'INC' or 'DEC' instructions (which can also operate on memory). Due to the way EBP and ESP are used, there's no reason whatsoever to add or subtract 1 to/from these registers in the 32-bit machine. So, of the 16 opcodes, 4 of them are utterly useless and all are redundant.

There's a question here of why they are there in the first place: the redundant 1-byte encodings have been there since the 8086, which was designed to run 8080 code with little modification. For various reasons, it was very common to use INC and DEC instructions on 16-bit registers in the 8086, so providing 1-byte equivalents for these was probably a reasonable decision. This rationale would not extend to supporting all 8 registers, but it's reasonable to due so for symmetry reasons (and to sure, when they were designing the 8086, nobody was saying "Hey, leave some space for the 32-bit stuff!").

So you may see where this is going. In the AMD64 machine, these 16 opcodes are stripped of their old meaning, and merged into a new 'super-prefix' which provides four additional bits of opcode (whereas an intel-style prefix only supplies 1, by being absent or present). One of these four bits is used to expand the register set from 8 to 16 registers (both counts including the SP and BP). When executing 16-or 32-bit code, the opcodes have their original INC/DEC meaning.

Now let's be very clear here: the trick of reclaiming the INC/DEC instructions to add, among other things, 8 extra registers, is something which Intel could have just as easily done when designing the 32-bit machine. There is no need for, or advantage in, keeping instructions just for backwards compatibility when crossing that Rubicon. Other than being less work for them to design and verify it, I guess.

Incidentally, the AMD64 also has 16 SSE2 registers (the 128-bit wide vector registers) whereas the 32-bit machine (as designed by Intel) only has 8. The same extra bit is used to select the extra registers. This makes a huge difference in vector code, since you can easily run out of registers when coding a vector operation with only 8.

Note, strictly speaking EBP may be classified as a general purpose register, which means the 32-bit machine has 7 and the 64-bit has 15 registers. But in most code the compiler uses EBP for its function as the stack frame pointer.

AMD also 'dropped' the DAA, DAS, AAA, AAS opcodes in 64-bit mode, but did not redefine them; they are thus available for future enhancements. The 'segment register' model which intel introduced in the 8086, and extended in the 286, is also effectively dropped in the 64-bit machine, thus reclaiming some more opcode space. Ever since the 386 added a proper VM paging unit, no OS has used the segment registers for their 'VM-like' purpose.

Going back to the orginal 32 vs 64 discussion, Another point you sometimes see, is that 64-bit machines have wider data buses to move things in and out of memory. Well, no. Memory buses have been 64 bits wide for a long time, while load/store operations between the processor and the cache may be limited to 32 bits on a 32-bit machine, the cache/memory interface is not tied to this. And if you look at the original question, you've already got that 64-bit processor with its memory interface, that won't change if you run it in 32-bit mode.

Other discussions:

http://www.pcstats.com/articleview.cfm?articleID=1665
http://www.infopackets.com/news/hardware/2006/20060824_32_bit_vs_64_bit_systems_whats_the_difference.htm

http://blogs.zdnet.com/hardware/?p=316&tag=rbxccnbzd1

http://www.tbreak.com/reviews/article.php?id=295

References:

http://en.wikipedia.org/wiki/AMD64

The Barking Pixel

Blog Archive

Handy Links

Related Blogs

Tuesday, September 15, 2009

Obama called Kanye West a 'Jackass'...

Obama called Kanye West a 'Jackass' in an informal, off-the-record conversation today; ABC leaked it. Hmm, what to do... I know, Obama should have another get-together. Invite Taylor Swift and Beyonce to the White house, have a beer and talk about what a jackass West is.

Thursday, September 10, 2009

That's Redundant; In Addition, It Repeats Itself

- cnn.com

Wednesday, September 2, 2009

32-bit vs 64-bit : Enough Already