xchg() for a single-byte location assembles to a 4-byte Compare&Swap,
wrapped into a non-trivial amount of retry code that deals with
concurrent modifications to the unaffected bytes.
Change it to a simple byte-store, but preserve the memory ordering
semantics that the CS provided.
This simplifies the generated code for a hot path, and in theory also
allows us to amortize the memory barriers over multiple SLSB updates.