Transactional memory on x86 will come in two flavors:
- Hardware Lock Elision (HLE), which consists of XACQUIRE and XRELEASE instruction prefixes. These can optimistically turn lock regions into transactions. What's the advantage? Well, transactions can execute concurrently, as long as they aren't writing to the same memory. Locks are serialized: only one thread can be in a lock region at a time. So this mechanism can (hypothetically) execute more code concurrently. Also - and perhaps more importantly - it should be much cheaper to acquire a lock using XACQUIRE than it is to acquire it without, since with XACQUIRE you don't have to do anything other than tell the system to start looking out for conflicts. Between the two advantages, HLE should be able to provide a nice performance boost. These are backwards compatible, so that software written to use them can be run on earlier hardware.
- Restricted Transactional Memory (RTM), which consists of XBEGIN, XEND and XABORT instructions. These have the traditional transactional memory semantics: you begin a transaction, and if you abort before you end it, your writes are undone. These are very flexible, but are not backwards compatible (i.e., you can't use them on older hardware).
I like the idea of hardware-level transactions / atomicity. Compilers can be written to take advantage of them, and they can be a win for people implementing synchronization primitives and doing very low-level multithreading. They have the potential of making synchronization very cheap when there isn't much contention. AFAICT, Azul has claiming
very large benefits from their HLE-style synchronization for years (edited: They claim <10%: see this presentation on hardware transactional memory from Cliff Click).
There is also likely to be a renaissance in lock-free data structures. Currently, lock-free data structures are pretty much all built on top of one non-blocking atomic instruction: the compare-and-swap (LOCK: CMPXCHG on x86, or AtomicReference.compareAndSet() to Java programmers). Trying to shoehorn everything you might want to do atomically into a single instruction that operates on one word has led to some very interesting data structures over the last 20 years or so, but is really the wrong way to solve the problem.
The flip side of this is that I don't think these will have much of an impact on most user-level synchronization. The HLE is really just an optimization for current lock implementations. The RTM aborts if you execute any of a large number of instructions (like PAUSE or CPUID, or many operations that affect control flow), or if there is a conflicting write to the cache line (which may not have any logical connection to the data the programmer cares about): the average programmer who deals with multithreaded code can't reasonably be expected to think about what is going on at the microarchitectural level.
Also, as with most transactional memory systems, RTM only deals with memory writes to a limited number of cache lines. If you have to deal with, for example, I/O, or if you touch too many cache lines, you have to write an abort handler that deals with rollback correctly. My feeling is that this will probably be far too difficult for most programmers.
Transactions' usefulness also depends enormously on the quality of their implementation. I'd love to see some hardware, but it isn't expected to be released until 2013.