I now have a working implementation of the MELTDOWN attack, which I developed in about one day after reading the Lipp et al. paper. It is not pretty; the one from Pavel Boldin is much more complete.

I got stuck for hours, not understanding why I could not transfer information depending on secret data through the side channel, until I read Pavel Boldin’s implementation and realized that just before conducting the attack, he made a system call that made the kernel read the secret data under attack (as an example, he read from /proc/version, which makes the kernel read from a format string known as linux_proc_banner, and then checked that he could that format string from the supposedly protected kernel memory).

I suppose that the reason is that this loads the string into some of the CPU caches, which would imply that the MELTDOWN attack only works for data that is already in some of these caches. I suppose that if a cache reload of the secret data is needed, then the protection check will be performed before the speculative loads are performed.

I am unsure whether there is a method to force the cache loading of arbitrary protected memory locations.

I also have a working implementation for breaking kernel address space layout randomization (KASLR). I simply implemented the ideas from Jang et al., Breaking Kernel Address Space Layout Randomization with Intel TSX: namely, it takes fewer CPU cycles to generate a protection fault against a mapped, but protected, page, than for unmapped space; in both cases this generates a segmentation violation, but no trap is generated if the illegal accessed is wrapped in a transaction (Restricted Transactional Memory). It is sufficient to measure the time taken to complete the transaction, on my machine it is about 186 cycles when mapped versus 200+ cycles when unmapped.

PS: My idea about the data needing to be inside a close cache is confirmed by Intel's analysis:

For instance, on some implementations such a speculative operation will only pass data on to subsequent operations if the data is resident in the lowest level data cache (L1). This can allow the data in question to be queried by the application, leading to a side channel that reveals supervisor data.