The x86 architecture was invented in 1978; it has been extended several times since then, to add new features used by new classes of application. I'm putting easy explanations here; the technical references are at the end of the document.
This technology was invented by Intel, introduced in January 1997, and is available in all the currently-available x86 chips. It uses the 64-bit floating-point registers for what are called SIMD instructions - 'Single Instruction, Multiple Data'.
For example, you can put eight eight-bit bytes in a single register, and the PADD instruction will add corresponding bytes from two registers; this is not possible with normal addition instructions since the carry from one addition can spill over into another.
The technology is useful for 2d graphics (remember that drawing texture-mapped triangles is an example of 2d graphics), particularly in 256-colour modes; with the rise of graphics accelerators, it is more used nowadays for manipulating audio and for producing procedural textures.
This is a series of Cyrix extensions to the MMX instructions, which should enable very much faster motion-video work. They include implied-register instructions - almost a step towards 3-argument instructions; for example, PADDSIW MMa,MMb sets MM(a^1) to the packed-saturated-add of MMa and MMb - and a series of instructions useful for MPEG distance estimation.
At a conference in October 1997, Cyrix planned to introduce its own set of SIMD FP instructions, called MMXfp; later, this was abandoned as they decided to license 3DNow!.
This is a technology invented by AMD in mid-1998, and now available both on their chips and on the IDT Winchip 2. It adds extra instructions which manipulate the MMX registers - so it can be used at the same time as MMX. Some of these are to help with video decoding, but the more interesting set are for SIMD FP
In each 64-bit MMX register, 3DNow! stores two 32-bit single-precision FP numbers; it provides a medium-sized set of instructions which manipulate both the numbers in the register at once.
This is very useful for 3d graphics, where you might want to compute something like (a*b)+(c*d)+(e*f)+(g*h); 3DNow! can do this in five instructions, which takes five clock cycles. It is mostly used in games at the moment, either explicitly or by using Direct3d drivers with 3dNOW! instructions in; the speedup can be quite significant.
Unfortunately, what you often want to do is multiply vectors by 4x4 matrices, and 3DNow! doesn't have enough registers to do this without loads and saves to memory. This is an unavoidable restriction of using the MMX registers.
The Athlon implements the extended-MMX subset of SSE, along with 'PSWAPD' which swaps the two words in a 3DNow! register. It also adds four 'DSP' instructions to 3DNow for converting floating-point numbers to 16-bit integers and back again and for speeding up complex-number arithmetic and the inmost loops of Fourier transforms.
This technology, introduced by Intel on the Pentium Pro and also used in the P2 series and the AMD Athlon, is a very unusual one. Its aim is to make it possible to write programs which work on unpredictable data without the delays that arise when a branch is mispredicted and the (long) pipeline of the chip has to be emptied.
Conditional branches have been around forever; FCMOV provides
conditional FP and integer move instructions. That is, you can
issue a command like 'CMOVNE EAX,EBX', which would check whether
the NE flag was set, do nothing if it wasn't, and set EAX to EBX
if it was. The idea is that it is occasionally quicker to compute
both sides of an if
statement and throw away the one
you don't use, rather than risk a branch misprediction.
The technology is becoming almost mainstream, although the Cyrix C3 chips still don't support it. Unfortunately, the Intel architecture doesn't really have enough registers to store the results of both sides of an even slightly complicated calculation - it's hard enough to do a whole calculation in registers if you can use all seven - so the potential for improvement is not so great.
The Deschutes processors have two instructions for saving and restoring the FP registers. The task-switching code in Windows 98 uses these, and there is a patch which makes Linux use them too. Their main purpose is to support
Otherwise known as KNI, the Streaming SIMD extensions were introduced on the Pentium III around the end of February 1999.
These introduce a new set of register- rather than stack-based floating-point operations, including vector commands which perform four single-precision floating-point operations in parallel - but, unlike 3dNOW!, they use their own set of registers, so you can use MMX or floating-point code in parallel with them. And, because they're wider (8 registers, 4 floats per register using vector operations), you can fit a whole 4x4 matrix in them. There's still not enough space to multiply two matrices entirely in registers, but you can't have everything ...
For the P3, Intel skimped somewhat on the implementation, using only a two-wide ALU, so the average performance of SSE and 3DNow will be the same - I've constructed sequences of instructions which are faster on 3DNow. It's possible they'll use a four-wide one on later chips, which would make SSE roughly twice as fast as 3DNow.
SSE also includes a few new MMX instructions; about half of these are straightforword operations omitted from the MMX standard (some instructions useful in video decoding, and commands for shuffling MMX registers and moving bits of them into normal registers), and the other half are memory operations which make the cache hierarchy rather more explicit than usual. You can prefetch data into caches, and you can store data directly to memory without contaminating the cache along the way; this last makes for a very fast memcopy operator.
The Pentium 4 implements SSE2, which extends SSE by providing a range of instructions to manipulate integer and double-precision data stored in the SSE registers; for integer work there is (at last) a parallel 32 x 32 -> 64 multiply command, as well as extended versions of all the MMX instructions. For FP work, you can do all the exact SSE operations in double precision, though the approximate reciprocal and square root commands aren't extended.
The Pentium 4 also introduces a selection of extra instructions for moving data to and from memory without intermediate caches and for enforcing load and store ordering in the face of the drastically out-of-order core.
The Prescott core introduces a few additional instructions: one for converting numbers from floating-point to integer format efficiently, a couple for improved hyper-threading, and several for better computations of Fourier transforms and with complex numbers.
The Pentium III has a CPU unique ID number - every CPU is uniquely identifiable. This gave rise to a degree of paranoia about tracking people online by looking at their CPU UID: given that many people can use a single computer, that the CPU UID requires special software to read, and that it gives little more information than can be obtained with cookies, the paranoia seems unjustified. The CPU UID is quite useful as an inventory-tracking tool, though.
Lots of these are in .PDF format, which can be read with Adobe Acrobat.
Intel's SSE reference manual (PDF)
Intel's MMX application notes (useful examples) (HTML)
Intel's MMX reference manual (HTML)
Cyrix's 6x86MX manual, including details of EMMX
AMD's notes on optimising for the K6-2 (PDF)
AMD's manual on optimising for the Athlon (PDF)