Category Archives: Tech

Removing Fences

If you’re ever looking to squeeze the performance of lock-free code it’s tempting to try and get rid of unnecessary memory fences. On x86/x64, this can be tantalising, as there’s only a single type of memory fence required at the processor level: the store-load fence.

So on such architectures the question arises: Are there thread-safe data structures where we can do without even a store-load fence?

For coordinating just two threads, the answer is of course yes, as only acquire-release semantics are required for example by a SPSC queue. But what about more generally?

The answer turns out to also be yes, as shown in this ASPLOS 2014 paper:
Fence-Free Work Stealing on Bounded TSO Processors

Doing this requires guessing the “re-order bound” of the processor in question, which is more-or-less just the capacity of its store buffer. In this post, I’m going to describe and reproduce an experiment the paper uses that allows guessing the store buffer capacity of an Intel processor.

A Tiny Bit of Background

So what are store buffers and why do processors have them? One reason is that processors are typically speculative, meaning that they will execute instructions before it is certain those instructions are actually required. This is enabled by so-called branch prediction, which roughly amounts to the processor guessing ahead of time which way an if-statement will go.

One type of instruction that makes speculation tricky is a store to memory. If the store in question turns out not to be required after all, then the processor would need to somehow undo the store in memory. That would be bad enough, but by speculatively storing to memory the visibility of the store to other processors would make programming such a machine hell. The solution to this is to have stores first write into the store buffer, and then only later “drain” the values from the store buffer to cache when the branch prediction has been verified and the other required conditions are met.

Store Buffers Meet Memory Models

In the x86 memory model, it’s the store-buffer then that introduces the possibility for an earlier (in program order) store to be re-ordered with a subsequent load. This famously makes Petersen’s mutex (or “Dekker’s algorithm”) incorrect without a store-load fence. A minimal demonstration from Intel’s memory ordering white paper is here.

Stores can’t be re-ordered past an arbitrary number of loads though – store buffer capacity is finite. In the Intel® 64 and IA-32 Architectures Optimization Reference Manual they report the store buffer as having 42 entries for a Haswell microarchitecture, for example.

An Experiment

Enough background! How can we indirectly observe the store buffer capacity? Well, Intel includes this warning in the Intel® 64 and IA-32 Architectures Optimization Reference Manual:

As long as the store does not complete, its entry remains occupied in the store buffer. When the store buffer becomes full, new micro-ops cannot enter the execution pipe and execution might stall.

So to get a glimpse of the store buffer in action, the start of the idea is to repeatedly execute a sequence of S stores, varying S from say 1 up to 100, and computing the average time per iteration, something like this:

for S = 1 to 100
Record StartTime
for N = 1 to 100000 # Just repeat a fair few times
Perform S stores
Record EndTime
Report (EndTime - StartTime)/100000 # Avg time per iteration


Now, this won’t give a very interesting result:

This isn’t giving much away about the store-buffer. The reason is that there is no time for the store buffer to drain, so all the iterations end up suffering from store-buffer stalls.

A simple way to give the store buffer time to drain is to just execute a stream of other instructions after them, even a simple no-op will do:

for S = 1 to 100
Record StartTime
for N = 1 to 100000 # Just repeat a fair few times
Perform S stores
Perform 500 no-ops # Allow store buffer to drain
Record EndTime
Report (EndTime - StartTime)/100000 # Avg time per iteration


What now happens is quite interesting: while there are remaining store-buffer entries, the latency of the no-ops will be at least partially hidden. Once the stores have been pushed to execution ports, the first few hundred no-ops can then “execute” from the re-order buffer (no-ops don’t really execute, and are eliminated in the “Renamer” portion of the pipeline, but the processor can only eliminate 4 of them per cycle so they do consume some time) until the stores retire and their store buffer entries are freed.

But what about when we attempt a store where no store buffer entry is available?
To get a read on that we can again refer to our trusty Intel® 64 and IA-32 Architectures Optimization Reference Manual:

Firstly, the description of the Renamer includes the following:

The Renamer is the bridge between the in-order part […] and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine.

As well as performing register renaming, we’re told that the Renamer:

Allocates resources to the micro-ops. For example, load or store buffers.

So it really seems that if a store buffer entry isn’t available, the store in question won’t proceed past renaming, and more importantly, the subsequent no-ops won’t be executed until it does so. As a result, however much of the latency of the stream of no-ops was previously hidden should become visible, as they no longer execute in parallel with the stores.

On my Haswell, this is the result I see:

The jump upwards in latency appears when 43 stores are attempted, and Haswell’s documented store buffer capacity is 42. The steepening is then (I guess) because each subsequent store must wait for a store buffer entry to become free.

So that’s it! We’ve revealed the 42 entry store buffer. Of course, nothing is stopping us from measuring a bit more directly.

Measuring

To get some independent evidence that this behaviour is a result of stalls caused by the store buffer, we can crank out VTune. The relevant event is RESOURCE_STALLS.SB, which Intel document as:

This event counts the number of cycles that a resource related stall will occur due to the number of store instructions reaching the limit of the pipeline, (for example, all store buffers are used). The stall ends when a store instruction commits its data to the cache or memory.

For 42 stores on my Haswell the RESOURCE_STALLS.SB row is clean as a whistle:

Changing things to 43 stores, shows a different story for RESOURCE_STALLS.SB, just as we’d expect:

So that’s a second not-very-surprising (but satisfying!) validation that store buffer entries are indeed being exhausted.

I also checked what IACA had to say about a loop kernel of 42 vs 43 stores followed by a stream of no-ops, it seems not to simulate enough of the microarchitecture to report anything interesting. Just for completeness, I’ve committed its 42 store throughput and pipeline trace, as well as its 43 store throughput and pipeline trace. It does allow you to see the 4 no-ops per cycle running in parallel with the stores though.

Conclusion

If you’d like to experiment with this yourself, you can find the code I used here: https://github.com/nicknash/GuessStoreBuffer. It’s a fairly hilarious / horrible setup, a C# program generates a C++ program with the required assembly instructions inside it!

Lastly, I’ll just note that although the above demonstrates the store buffer capacity it doesn’t nail down the “re-order bound” of the processor. In practice, Intel’s store buffers are coalescing — they can eliminate consecutive stores to the same location. This results in a 1 larger re-order bound than store buffer capacity. You can find more details in this paper

Writing While Not Reading Too Much

Recap

My last post described the concurrency construction called Left-Right at a very high level. Left-Right allows for concurrent reading and writing of a data structure with wait-free reads that can run in parallel with a writer. You can think of it as a neat alternative to copy-on-write or a better reader-writer lock, but it’s also interesting in its own right.

In this post I’ll present an implementation of Left-Right in C# and describe how it works.

Starvation

The main features of the Left-Right construction are the following:

1. It’s generic, works with any data structure, and doesn’t affect the time complexity of any operations.
2. Reads and writes can happen concurrently.
3. Reading is always wait free.
4. No allocation is required, so no garbage is generated, and no garbage collector is required.
5. A writer cannot be starved by readers.

The fifth property — that writers cannot be starved by readers means that a write operation will never be prevented from completing by read operations. It is also the trickiest part of Left-Right.

As a result, for this post, I’m going to describe the so-called “No Version” variation of Left-Right. It satisfies the first four properties, but readers can (in very unlikely scenarios) starve a writer.

Once you’ve got your head around this version of Left-Right, adding in the starvation freedom property is much easier to understand.

We’ll begin by describing how to use Left-Right to read a data structure. This will show the first key idea of Left-Right: It maintains two versions of the data-structure, allowing reads to proceed concurrently with a writer.

To keep the description of reads simple, we’ll assume the existence of a so-called “Read Indicator”, that is used to communicate in-progress reads with potential writers. Here’s the interface a read-indicator exposes:

interface IReadIndicator
{
// A reader calls this to tell writers that it is
void Arrive();
// A reader calls this to tell writers that
// it is no longer reading the data structure
void Depart();
// Returns true if a read in progress (i.e.,
// any calls to Arrive without a matching Depart call)
bool IsOccupied { get; }
}


We’ll ignore the implementation of IReadIndicator for now. Implementations can be very simple — a single atomic counter will do (and be unscalable in the presence of many threads), or they can be quite intricate and more scalable.

With our read-indicator in hand, the implementation of a read looks something like this:

// An array of the two instances of the data
// structure we are coordinating access to. For this example, a List
List[] _instances;
// An index (either 0 or 1) that tells readers
// which instance they can safely read.
// A read indicator, that a reader can use to inform a potential
// writer that a read is in progress.
{
return result;
}


This code is hopefully self-explanatory: readers just record their presence during reading.

Writing

Writing is a bit trickier than reading, and goes something like this: Write to the instance of the data structure that we’re sure there can be no readers reading, wait for readers to finish reading the other instance, and then write to that instance.

A (toy) implementation would look something like this:

void Write()
{
lock(_writersMutex)
{
}
}


The first thing to note about this is that writes are blocking: there can only be a single write in progress at once, hence the _writersMutex.

Next we’ll turn to understanding exactly why this write implementation can proceed even in the presence of readers

Why This Is Correct

The argument for correctness I’m about to give might seem a bit laboured. I think it’s worth it though, because when we switch to the starvation-free version of Left-Right we can use the same tools.

Let’s define the end-state of a reader as the value of _readIndex it has witnessed at the time it calls Depart. So there are two possible reader end-states: 0 and 1. After a reader calls Depart it is said to complete and is not defined to be in any state.

Let’s also divide readers into two categories: pre-arrival readers and post-arrival readers. Pre-arrival readers have not called Arrive on a read-indicator yet. While post-arrival readers have called Arrive on a read-indicator.

Now we can note the key property that makes Left-Right correct: If a writer waits until _readIndicator.IsOccupied returns false while _readIndex is 1 then there cannot be any reader with end-state 0. This is because any pre-arrival reader cannot have yet read _readIndex because it has not yet called _readIndicator.Arrive(), and hence if it does it must see _readIndex as 1. On the other hand, the wait on _readIndicator.IsOccupied ensures that any reader with end-state 0 will complete. Of course, a symmetrical argument applies if a writer changes _readIndex from 1 to 0.

To be completely explicit, mutual exclusion between readers and writers is assured when a write to proceeds on a given _readIndex when the only possible reader end-states are the toggle of _readIndex. Moreover the preceding paragraph shows this is exactly the protocol obeyed by Write, and so this implementation is correct.

Next Time

The Left-Right implementation just described is nearly the real-deal. The unfortunate thing is that under intense read pressure, readers can starve the writer. This is because the wait on _readIndicator.IsOccupied may wait on readers that begin after the call to Write began. I hope to describe how to lift this limitation in my next post.

Lastly, there is a gap in the explanations above: I’ve omitted all reference to the required memory ordering on the operations (e.g. so called ‘volatile’ operations in Java and C# or ‘atomic’ operations in C++), this is also something I hope to give full details of in a future post. In the mean-time, there is a RelaSharp example of this Left-Right implementation, that includes memory fencing here: StarvationLeftRight.cs

It’s fairly common to have a data structure that is occassionally mutated but often read, with the writing and reading occurring on different threads. This type of scenario can arise during intialization or in a loading phase. A bunch of data structures are built, with potentially many mutating threads, and from then on (or for the next long while) are only read, again potentially from many threads. The usual ways of dealing with this are reader-writer locks, or occasionally a lock-free approach, or perhaps even RCU.

A while ago I wondered if there were other generic, efficient ways to deal with this scenario, and I happened upon the very interesting so-called Left-Right construction. I thought it was so nice I’d even resurrect my blog to describe it!

Left-Right to the Rescue

Left-Right is a genuinely new way to deal with concurrent reading and writing of data structures, first described by Pedro Ramalhete and Andreia Craveiro (see here for a paper). After I first read about it, I decided I’d throw together a little C# implementation, as one apparently didn’t exist. I must also say that Pedro Ramalhete was extremely helpful and friendly in answering my queries about Left-Right.

But before describing how the Left-Right construction works, I’ll motivate things a little by revisting a simpler generic solution to the problem of concurrent reading and writing of a data structure, namely lock-free copy-on-write. There are other examples that could also be used to motivate things, like an optimised reader-writer lock. Copy-on-write is handy to describe because it’s fairly self-contained and has similar-ish progress guarantees to Left-Right, and perhaps most importantly it is similarly an extremely general construction.

Copy-on-Write

Making a data structure thread-safe via copy-on-write simply copies the entire data structure when a write occurs. This sounds expensive, but in some situations where the read-path latency is critical, it can be justified.

For a single writer, and many readers, an example of copy-on-write is something like this:

List data;

void Write(int v)
{
var copy = new List(original);
Voltile.Write(ref _data, copy);
}



This is neat in a few ways:

• It’s very simple.
• It’s generic, and works with any data structure.
• Reads and writes can happen concurrently.
• It relies only on a store-release by the writer and a load-acquire by the reader. So, on say x86/x64, the thread synchronisation is free.
• If we are mostly reading (or after an initialisation phase, always reading) the reading path is very efficient, hardly different than when only a single thread is involved.

• For large data structures, writing is very inefficient. In this case, an (amortized) $\Theta(1)$ add to end of list changes to $\Theta(n)$ time.
• It creates garbage. Or maybe worse it requires a garbage collector of some kind. Something has to be responsible for freeing the memory associated with data when it is replaced with the copy, and this must only happen when all readers are finished with the pre-write version.
• In the presence of multiple writers, scaling is very poor – worse than blocking – as described next.

In the presence of multiple writers, things gets worse. To support multiple writers, we need to do something like this:

void Write(int v)
{
while(true)
{
var original = x;
var copy = new List(original);
if (Interlocked.CompareExchange(ref _data, copy, local) == local)
{
break;
}
}
}


For workloads that are not overwhelmingly writers or larger lists, this scales very poorly indeed. In fact, simple blocking completes more quickly as we’ll see next.

A Micro-benchmark of Copy-on-Write

To get a slightly better feel for how copy-on-write behaves, a micro-benchmark can tell us a little. The graphs below compare three simple list implementations:

• MultipleWriterCOWList: This is the copy-on-write list described above, and is the one we’re really talking about.
• LockList: This is a list that is protected by a simple lock, i.e., it’s fully blocking for reads and writes.
• RWLockList: This is a list protected by a reader-writer lock (using the ReaderWriterLockSlim in C#).

The following two graphs show some of the higher percentiles for the latency of an operation (either a read or write) on the lists, for 4 threads, a 100-element list and a workload of 95% reads (left), and 5% reads (right):

Looking at the left plot, the copy-on-write approach fairs quite well. The right plot reminds us how much copy-on-write depends on the number of writes being comparatively small though. Incidentally, the fact that the reader-writer lock is so bad is interesting too, but a topic for another day!

Next we can look at how copy-on-write scales with the number of threads involved, jumping up to 20 threads, things don’t go well for it:

The left plot shows something any lock-free algorithm is vulnerable to: outliers. Lock-freedom guarantees some thread makes progress at each step, but makes no guarantees about fairness or starvation. In the case of the left plot, above about the 95th percentile, the blocking list (“LockList”) has lower latency. On the right, the setup is the same except it consists of only 5% reads. In this case, the y-axis is the logarithm of latency. In other words the copy-on-write list is a scaling disaster.

The last obvious knob we can turn is the capacity of the list in question. The copy-on-write list must do more work the larger lists. Switching to 1000-element lists with 95% reads for 4 threads (left) and 20 threads (right) shows the copy-on-write list again is vulnerable to outliers, more so than a trivial fully blocking list (“LockList”):

Aside from the latency distributions, we can look at the throughput of the various types of list:

The four graphs above show that for a 1000-element list, the copy-on-write list is always lower throughput than simpler blocking “LockList” alternative (again, why the reader-writer lock is so poor is a topic for another day!). Moreover, its throughput is only higher for a small, 100-element list for workloads that are overwhelmingly writes.

All the same, copy-on-write can still be attractive in the case where there is just a single writer. In that case, there is no question of outliers due to multiple writes competing (although they might still occur if the data structure was large).

It seems natural to wonder if there is some intermediate solution between reads being as cheap as in the single-threaded case, and writing involving copying the entire data structure. That is, is there a solution where reads involve more work than copy-on-write, but writes require less. It turns out that the Left-Right construction is an example that fills this gap.

A First Look at Left-Right

Copy-on-write creates copies of the data structure only at the moment when writes occur, instead of this, what if we always maintained two copies of the data structure?

A write operation would always write to both instances, but it would only write to an instance when it can be certain that no reader is reading that instance.

This is basically the first idea of Left-Right. The remaining details are the ingenious mechanism for coordinating reading and writing threads cheaply and with strong progress guarantees.

Left-Right combines the following surprisingly powerful guarantees:

• It’s generic, works with any data structure, and doesn’t affect the time complexity of any operations.
• Reads and writes can happen concurrently.
• Reading is always wait free.
• No allocation is required, so no garbage is generated, and no garbage collector is required.
• A writer cannot be starved by readers.

• It’s more complicated than copy-on-write.
• It requires performing every mutation of the data structure twice.
• Multiple writers block each other (but actually, this scales better than copy-on-write, and at least matches a simple blocking solution).

Next Time

This post is getting a bit long, but hopefully it has piqued your interest in Left-Right!

There is an evolving landscape of solutions to the type of problem addressed by Left-Right (as well as simple old copy-on-write). If you’re interested in an overview and a ‘state of the art’ approach, Chen et al’s 2016 paper Fast Consensus Using Bounded Staleness for Scalable Read-Mostly Synchronization is a great read.

In some future posts I hope to describe a C# implementation of Left-Right, and how to validate the implementation using RelaSharp. Be sure to check out the blog of Perdro Ramalhete and Andreia Craveiro where you can find lots of information on the creation of Left-Right.

On the off chance you’d like to see the (not very nice!) bench-marking code used for copy-on-write it’s available at https://github.com/nicknash/COWBench

Dark Origins

This is a short post about a tough puzzle I came across a few months back, originally at Two Guys Arguing. Since it’s the Matrix Puzzle From Hell, and it’s nearly Halloween, I thought it was ingeniously topical. I’m sure you agree.

If anyone knows the true origin of the puzzle, I’d love to hear it.

The Puzzle

A matrix has the property that every entry is the smallest nonnegative integer not appearing either above it in the same column or to its left in the same row.

Give an algorithm to compute the entries of this matrix.

An Example

Here’s an example matrix:

0 1 2 3 4 5 6 7
1 0 3 2 5 4 7 6
2 3 0 1 6 7 4 5
3 2 1 0 7 6 5 4
4 5 6 7 0 1 2 3
5 4 6 7 1 0 3 2
6 7 4 5 2 3 0 1
7 6 5 4 3 2 1 0

It looks like this matrix is full of patterns, so what’s a good algorithm for computing its entries?

As with any puzzle, it’s probably worth trying it before reading the spoilers below. I’ll reveal the answer slowly though.

A (Very) Naive Algorithm

Without really thinking about the puzzle, we can at least give a correct algorithm. To compute a particular entry, compute all the entries above it, and to its left, and choose the smallest nonnegative integer not included in the union of those two.

If we’re considering an $n \times n$ matrix, then to compute a given entry we can be forced to compute up to $\Theta(n^2)$ other entries, because we need to compute all the entries above it and to its left (and all the entries above them and to their left). Once we have the entries above us and to our left we can sort them, in say, $\Theta(n \log n)$ time and then find the smallest excluded element with a linear scan. So the algorithm takes $\Theta(n^3 \log n)$ time.

This certainly feels very naive. We’re not using any of the problem’s structure. At least we have a time complexity though.

Patterns

Let’s look at a small example matrix

0 1 2 3
1 0 3 2
2 3 0 1
3 2 1 0

There’s lots of symmetry here, and quite a few patterns to consider. Let’s just look at the top-left 2×2 entries though:

0 1
1 0

To generate the 2×2 sub-matrix of entries in the top-right of our original 4×4 matrix, we just add 2 to these entries. To generate the 2×2 sub-matrix of entries in the bottom-left, we just add 2 to these entries. To generate the 2×2 sub-matrix in the bottom-right, we just copy these entries.

We now have our 4×4 matrix, if we extend it in the same way, adding 4 this time, we’ll get our original 8×8 example matrix.

We can even start with a 1×1 matrix: 0, and extend it into the initial 2×2 matrix, by adding one (and copying).

So this looks like a very useful pattern. We’ve not thought about why, or if, it’s right just yet. It looks right though. So let’s try and bash out a better algorithm.

A Better Algorithm

It looks like we can generate the entire matrix by the “doubling” pattern just described: Start with the 1×1 matrix (which is just 0). Double it to give the 2×2 matrix, double that to give a 4×4 matrix, etc.

To pick out just a single entry, generating the whole matrix isn’t going to be the right way to go though. We’d still only be down to $\Theta(n^2)$ time (and space) — not as bad as $\Theta(n^3 \log n)$, but we can do better.

Instead, we can work backwards. To work backwards from a given row and column $r, c$, we just need to find the smallest number $m$ such that $r, c < 2^m$. Then we know which $2^m \times 2^m$ matrix the entry “belongs” to. Then we find out which quadrant of that matrix the entry is in, and then which quadrant of that quadrant the entry is in, and so on, until we reach the top-left position.

More precisely, we’re saying of our matrix $M$ that given $r, c$:

• Choose the largest integer $p$ such that $2^p \leq r$
• Choose the largest integer $q$ such that $2^q \leq c$
• If $p = q$ (i.e. bottom-right quadrant), then the answer is $M[r - 2^p, c - 2^p]$
• If $p > q$ (i.e. bottom-left quadrant), then the answer is $2^p + M[r - 2^p, c]$
• If $p < q$ (i.e. top-right quadrant), then the answer is $2^q + M[r, c - 2^q]$

Repeating this until we get to $r = c = 0$ only takes $\Theta(\log \max\lbrace r, c\rbrace)$ time (because we always more than half either our row or column position), and constant space. That’s a lot better than our first attempt.

But can we do even better still? It turns out we can. A lot better.

Harder, but Easier

When I first heard this puzzle, there was a condition attached to the time complexity of the solution. If it had been that there is a solution working in $O(n^{1/3}(\log n)^2)$ time, I think I would have taken forever trying to think up some complicated algorithm to solve it.

Instead, it said the solution had to work in $O(1)$ time and space per entry. In theory, having to solve a problem in a better time is harder. In the land of puzzles though, knowing we can do it in constant time makes getting to an answer easier. It’s almost a spoiler.

Knowing it can be done in constant time and space means there must be a formula for the entries. Let’s start simple, and go through our tools. Does $M[r, c] = r + c$. No, obviously not. What about the good old bitwise operators though? How about $r \mbox{ OR } c$? Well, no because $1 \mbox{ OR } 1 = 1$ but we can see $M[1, 1] = 0$ in our examples.

What about XOR? It turns out that’s it. We can just XOR our row and column numbers to get the entry. Try it!

But Why?

It’s a bit of a shocker to know the answer is just plain-old XOR. Who knew it computed the minimum excluded number in a matrix?

Well, if we look at our logarithmic time algorithm, it’s pretty easy to see that it’s nothing more than an XOR of the row and column. When it said “Choose the largest integer $p$ such that $2^p \leq r$“, we’re really asking about the most-significant bit of our row number. It’s the same with our column, then we compare these bits, in fact, XORing them into our answer, and repeat with the remaining bits.

This at least explains how we get to XOR from our quadrant-based algorithm. We still haven’t really explained why our initial “doubling” construction was correct though. Before we do that, let’s have a look at a pretty picture.

Pretty Picture

As a neat little aside, it turns out that if you draw a picture of our matrix (zero being black), here’s what you get:

Pretty much any demoscener (or graphics hacker) will recognize this as the XOR texture, the easiest of all textures to generate. Just try googling it!

Speaking of the demoscene, here’s video of 4k Amiga intro making extensive use of the XOR pattern (not usually a badge of honor in the demoscene world, incidentally): Humus 4.

I think it’s neat to know that texture is all about minimum excluded numbers, after all these years.

But Really, Why?

We never really explained why our doubling construction is actually correct. It’s pretty obvious, but let’s see about that now.

Suppose we have a $2^k \times 2^k$ version of our matrix $M$. Our doubling construction gives us a $2^{k+1} \times 2^{k+1}$ matrix, such that:

• The upper-left corner is $M$
• The bottom-left corner is the entries of $M$ plus $2^k$
• The bottom-right corner is $M$
• The upper-right corner is the entries of $M$ plus $2^k$

Now let’s think about the simplest version of our matrix. If $k = 0$, we have a $1 \times 1$ matrix (containing a single entry, 0) for which every row and every column contains all the numbers in $\lbrace 0, \ldots, 2^k - 1\rbrace$. Because of this property of its rows and columns, we’ll call that matrix “complete”.

If we apply our doubling construction, we have a $2 \times 2$ matrix that is also complete. Inductively, it’s easy to see we’ll continue to get complete power-of-two-sized matrices by doubling complete power-of-two-sized matrices.

The question is now, in a doubled matrix, is each entry the minimum excluded number? Consider any $2^{k + 1} \times 2^{k + 1}$ matrix of minimum excluded numbers. Take the upper-right corner (UR), it can only contain numbers in $\lbrace 2^k, \ldots, 2^{k + 1} - 1 \rbrace$, both in order to guarantee exclusion (because the upper-left corner (UL) is complete) and also in order that the entries are as small as possible. Now, if any entry in UR was smaller than $2^k$ plus the corresponding entry in UL, then we could subtract $2^k$ from all the numbers in UR and get a $2^k \times 2^k$ matrix of minimum excluded numbers that had a smaller entry than UL somewhere. This is obviously impossible, because we know UL already contains the (unique) minimum excluded numbers in each entry.

We can make similar arguments for the other two quadrants. That’s another way to see the XOR connection. The upper-left and bottom-left quadrants are restricted in the numbers they can choose as entries because they have only either a different row or a different column (i.e. exclusive or) to the entries in the top-left quadrant. Whereas, the entries in the bottom-right quadrant are unrestricted by those in the top-left quadrant having both a different row and column.

Nim, Nimbers and Beyond

To finish up, it’s worth noting that there’s a kid’s game played with piles of sticks called Nim, which the solution to this puzzle represents a winning strategy for.

In fact, there are even things called Nimbers, and XOR corresponds to Nimber addition. There’s also a Nimber multiplication operator. I wonder what it looks like as a texture, compared to what the XOR texture looks like?

Knuth’s Wisdom

Prologue

In a previous post I talked about trying to bring out the worst in sorting algorithms. Given the torrent of interest in that post, I’m going to continue with the sorting theme. So, in this post, I’m going to talk about a sorting algorithm that brings out the worst in a CPU: bubble sort. This is clearly a point of deep industrial and practical interest, so, fasten your seat belts!

It turns out that the real reason a bubble sort performs so badly on today’s CPUs is just a little surprising. Actually, I wrote about this some time ago with some friends in a paper, but it was a long paper, and I felt this fun little bubble sort nugget got a bit lost inside it.

Bubble Sort

Mundane and all as it is, let’s remind ourselves what bubble sort is.

For an array of $n$ elements, bubble sort works by performing roughly $n$ sweeps over the array. A sweep just scans from left-to-right, swapping an element with its immediate right neighbour if that neighbour is smaller than it.

So in C, the sweep step might look like this:

void maybe_swap(int* a, int i)
{
if(a[i + 1] < a[i])
{
swap(a[i + 1], a[i]);
}
}

void bubble_sweep(int* a, int n, int sweep_number)
{
for(int i = 0; i < n - sweep_number; ++i)
{
maybe_swap(a, i)
}
}

In all its glory, bubble sort just calls the function above with “sweep_number” ranging from 1 to $n - 1$. So it looks like this:

void bubble_sort(int* a, int n)
{
for(int i = 1; i < n; ++i)
{
bubble_sweep(a, n, i);
}
}

I’m ignoring an improvement that can be made to bubble sort here. That is, if no elements are swapped in an entire sweep, the whole algorithm can just terminate. I’m leaving that improvement out because adding it makes the point I’m trying to make in this post a bit messier without adding anything very interesting.

Without getting too side-tracked into implementations, I can’t resist including one in Clojure:

(defn maybe-swap [v i] (let [j (+ i 1)] (if (> (v i) (v j)) (swap v i j) v)))
(defn bubble-sweep [v sweep-num]
(reduce #(maybe-swap %1 %2) v (range (- (count v) sweep-num))))
(defn bubble-sort [v] (reduce #(bubble-sweep %1 %2) v (rest (range (count v)))))

With our coding done, we can run a simple experiment that has a surprising result.

An Experiment

A quick look at the bubble sort implementation above should convince us it takes $\Theta(n^2)$ time. This is obvious: each call to “bubble_sweep” takes $\Theta(n)$ time, and we make $\Theta(n)$ calls.

If we were feeling very curious, we might try timing individual calls to “bubble_sweep” as it’s called by bubble sort, just to see that the time taken is indeed linear each time. If we do that, here’s what we get:

The vertical axis here is time (shown in CPU cycles), and the horizontal axis is the “sweep_number” from our code. The input array is 50,000 32-bit integers.

I don’t know about you, but until now I felt like I understood the performance of the humble-looking “bubble_sweep”. Now, I’m confused. Of course we know it’s really taking $\Theta(n)$ time per sweep. It just looks like the constant hidden in there is varying pretty strangely. How come?

Not the Caches

A natural first thought is that maybe the number of swaps varies per sweep, and more swaps are more expensive? Maybe things are exacerbated by a CPU cache effect?

This is jumping the gun a little. It’s not a cache effect, because the entire array fits in the (L2) cache where I did the measurements. The first half of this idea is good though.

The number of swaps does vary per sweep, and the real expense is due to branch mispredictions.

Due to What?

Let’s quickly remind ourselves what branch mispredictions are. CPUs don’t just work on a single instruction at once. Instead, they divide the work for each instruction into say, 5 stages. Then after one instruction finishes stage 1, the CPU fetches the next instruction and starts it in stage 1 (while the first instruction is now in stage 2). This is called pipelining. In this case we’d have a 5 stage pipeline.

So if processing a single instruction took time $T$, and we divide it into $N$ stages, we still take time $T$ to process a single instruction (and actually, a little longer because of the expense of administering the pipeline). On the other hand, once the pipeline is full, a new instruction is completed every $T/N$ units of time. This is great. There’s a little extra latency for each instruction, but we’ve improved throughput by a factor of $N$.

The wrinkle of course is that the CPU needs to know the next instruction to keep its pipeline full. When we have conditional branches (like that “if” statement in “bubble_sweep” above), the CPU can’t know the next instruction. So it has to make a guess. The hardware on the CPU to do this is called a branch predictor.

When the branch predictor correctly guesses the next instruction, the pipeline can stay full. When it doesn’t it needs to abort the in-flight instructions and re-start from the correct instruction. Depending on the length of the pipeline, this can be very expensive.

Another Experiment

Let’s take a look at another experiment to try and confirm our branch prediction theory.
The top-most curve in this image, labelled “hardware predictor”, is the number of branch mispredictions, measured from the performance counter on the CPU, caused by the calls to “bubble_sweep” in bubble sort.

It certainly looks like branch mispredictions are indeed what’s at work. But why?

Intuitive Analysis

Looking at our experimental results again, it seems like bubble sort’s branches get harder and harder to predict, before becoming easier again. So what’s going on?

Thinking very intuitively, we might imagine that the swapping of array elements to the right in bubble sort is a kind of partial sorting. When the data is unordered, the branch is likely to be taken, when the data is somewhat ordered, it’s 50/50, and when the data is nearly sorted it’s very unlikely to be taken. So perhaps it’s moving between these two extremes of predictability that confounds the branch predictor.

We can do better than this though.

Detailed Analysis

Let’s think about the probability that a particular comparison branch is taken. So let $Q^k_l$ be the probability that the $l^{th}$ comparison branch of the inner-loop (i.e. “bubble_sweep”) is taken on the $k^{th}$ sweep of bubble sort.

Now, the $l^{th}$ inner-loop comparison branch can be taken only if $\texttt{a[l]}$ is not larger than all of $\texttt{a[0..l - 1]}$. For this to be the case, the $(l + 1)^{th}$ branch in the previous sweep must have been taken, since otherwise the previous sweep left the maximum of $\texttt{a[0..l - 1]}$ in $\texttt{a[l]}$. But the probability that the $(l + 1)^{th}$ branch of the previous sweep was taken is just $Q^{k-1}_{l + 1}$.

Given that the $(l + 1)^{th}$ branch of the previous sweep was taken, that probability that $\texttt{a[l]}$ is not larger than all of $\texttt{a[0..l - 1]}$ is $1 - 1/(l + 1)$, and so we have:

$Q^k_l = Q^{k-1}_{l + 1}\left(1 - \frac{1}{l + 1}\right)$

The penultimate step is just a result of the fact that the probability that the $l^{th}$ element of a random permutation is a left-to-right maximum is $1/l$.

Filling in the bases $Q^1_l = 1 - 1/(l + 1)$ for $1 \leq l < n$ and $Q^1_n = 0$, we see that when $l \leq n - k$ then $Q^k_l = \frac{l}{l + k}$, and $Q^k_l = 0$ otherwise.

So as our experiment showed us, and our rough intuitive analysis too, in the first sweep, $k = 1$, and the branches are highly likely to be taken, on the other hand, on the last sweep $k = n - 1$ and the branches are unlikely to be taken.

Or, said another way, this analysis tells us that initially left-to-right maxima are quite rare and branches are predictable, but the repeated shifting of them to the right peppers them throughout the array making branches unpredictable, until they are dense enough that things become predictable again.

Matching the Hardware

To round off our mighty analysis, we can try and reproduce the curves we saw in our experiments where we measured the total number of branch mispredictions per sweep.

If we let $S(k)$ be the average total mispredictions in sweep number $k$, then we know
$S(k) = \sum_{l=1}^{n-k}M(Q^k_l)$. Where $M(p)$ is the probability that a branch that is taken with probability $p$ is mispredicted.

One way of modelling $M$ is just to imagine an ideal branch predictor: if the branches are taken (independently) with probability $p$, then the ideal branch predictor can be wrong with probability at most $\min(p, 1 - p)$. This will of course, lower bound the branch mispredictions, because in reality branch predictors work hard to guess what $p$ is.

Filling in for the ideal predictor gives $S(k) = \sum_{l=1}^{n-k}\frac{1}{l + k}\min(l, k)$. After a lot of messy algebra we’ll find that $S(k) = (1 + H_n + H_k - 2H_{2k})k$ when $k \leq n/2$ and $S(k) = n - (1 + H_n - H_k)k$ otherwise (here $H_n \approx \ln n$ is the $n^{th}$ harmonic number).

Plotting this gives us the bottom-most curve, labelled “perfect predictor”, in the branch mispredictions plot above — lower bounding the real number of mispredictions, as we expect.

Shaker Sort

Bubble sort has been around a long time, and people have tried to hack it into something more efficient. One attempt at this is shaker sort. In shaker sort, there are two “bubble_sweep”s that alternate. One shifts large elements right as we did above, and the other shifts small elements left (the opposite of what we did above).

Shaker sort still takes $\Theta(n^2)$ time, but is more efficient than a naive bubble sort in practice. Shaker sort is still a dog though. A selection sort implementation easily beats it, while being much simpler (and insertion sort is better still).

But, this is where the plot thickens: the reason shaker sort is so much slower than selection sort is almost entirely down to branch mispredictions. If you care to measure, shaker sort causes fewer cache misses than selection sort, and executes a small number of extra instructions — but performs much worse due to branch mispredictions. On an unpipelined machine, this bubble sort variant might actually beat selection sort.

So What?

If there’s anything to conclude here it’s that, first and foremost, bubble sort is awful! Not only is it (in naive implementations) more instruction hungry and cache unfriendly than the other quadratic-time sorting algorithms — it manages to be fantastically confusing to branch predictors, on average at least.

The relative performance of the quadratic-time sorting algorithms is important in practice. This is the reason good quicksort implementations know to switch to insertion sort (which, as it happens causes only a linear number of branch mispredictions despite executing a quadratic number of branches).

As a result, I think it’s important to understand what’s going on behind the simple looking code of the quadratic-time sorting algorithms. It also shows that even simple and very similar looking pieces of code can behave quite differently.

And lastly, all this messing with bubble sort reminds us of Don Knuth’s near omniscience. In Volume 3 of The Art of Computer Programming, he remarks:

In short, the bubble sort seems to have nothing to recommend it, except
a catchy name and the fact that it leads to some interesting theoretical
problems.

Knuth isn’t referring to instruction pipelines when he makes this comment, but somehow he’s right about how bubble sort gets on with them too!

Discussion at HackerNews
Code from our paper.

The Quaternion Julia Set, in Clojure

Graphics programming is one of my favourite things to do when learning a new programming language. Since I have been learning a bit of Clojure recently, I decided to see if anyone had written up a quaternion Julia set renderer. I couldn’t find one after a bit of googling, so here we are.

My goal here isn’t to put together the julia set renderer to end them all, but just some Clojure code that produces interesting output while still being pretty simple. Hopefully it’ll be interesting to other people learning Clojure.

Of course, there are implementations out there in plenty of other languages, especially GPU targetted. Some of them produce very beautiful real-time renders, see for example Iñigo Quilez’s neat article on 3D julia sets rendered on a GPU. I originally rendered the quaternion Julia set in C++/OpenGL some years back after seeing Paul Bourke’s POVRay renders. I’ve also seen a nice description of the Buddhabrot drawn using Clojure.

The Julia Set

So what are we talking about? We want to draw some pictures of the quaternion Julia set.
A point, $z_0$ is in the Julia set if the sequence defined by

$z_{n + 1} = z_{n}^2 + c$

doesn’t tend to infinity. In our case, the points are quaternions and we just iterate the above until $|z_n|$ gets too large, or we hit an upper bound on the maximum number of iterations and give up on trying to make the series diverge.

This is what that looks like in Clojure:

(defn outside-julia? [q c max-iterations]
(loop [iter 0 z q]
(if (> (vec-length z) 4)
true
(if (< iter max-iterations)
(recur (inc iter) (vec-add (quat-mul z z) c))
false))))


This code uses 4 as the magic length at which we decide the series is diverging. We’ll take a look at implenting the functions vec-length, vec-add and quat-mul next.

Vectors and Quaternions

Like any demoscener or hobbiest graphics programmer, I’ve implemented linear algebra libraries many times. Compared to C++ or Java, Clojure is just beautiful. I wouldn’t like to know how much slower it is, but, compared to say C++ it’s amazing that these one-liners give the generality of n-dimensional vector operations so clearly and concisely.

(defn vec-add [& args] (vec (apply map + args)))
(defn vec-sub [& args] (vec (apply map - args)))
(defn vec-dot [u v] (reduce + (map * u v)))
(defn vec-length [v] (Math/sqrt (vec-dot v v)))
(defn vec-scale [r v] (vec (map #(* r %) v)))
(defn vec-normalize [v] (vec-scale (/ 1 (vec-length v)) v))


We can re-use these functions on quaternions, but we do need to define quaternion multiplication, for which we’ll need the cross product:

(defn det-2x2 [a b c d] (- (* a d) (* b c)))
(defn vec-cross-3d [u v]
(vector (det-2x2 (u 1) (u 2) (v 1) (v 2))
(det-2x2 (u 2) (u 0) (v 2) (v 0))
(det-2x2 (u 0) (u 1) (v 0) (v 1))))
(defn quat-mul [q1 q2]
(let [r1 (first q1) v1 (vec (rest q1))
r2 (first q2) v2 (vec (rest q2))]
(cons (- (* r1 r2) (vec-dot v1 v2))
(vec-scale r2 v1)
(vec-cross-3d v1 v2)))))


Now that we have the tools to generate terms in the julia set sequence, we can start thinking about how to draw it.

Naive Ray Marching

One easy way to take a look at a 3D slice of the quaternion julia set is just to follow parallel rays through each pixel on the screen-area we’re drawing to. To do this for each pixel at screen position $(x', y')$ we pick some minimum $z$ and consider a series of quaternions

$(x, y, z, p), (x, y, z + \Delta, p), (x, y, z + 2 \Delta, p), \ldots$

At each quaternion, we use the outside-julia? function shown above to see if we’re inside or outside the set.

Here $\Delta$ is the amount we march along the ray between checking if we’re inside or outside the set. The parameter $p$ just controls where we’re taking the 3D slice from, since the set is 4D after all.

After marching along in $\Delta$-sized steps if we find that we’re inside, we refine the estimate of the intersection point by marching in the opposite direction along the ray in smaller increments, checking the quaternions beginning at the first $z$-coordinate, $z_{in}$ we found to be inside the set, until we find one outside again:

$(x, y, z_{in}, p), (x, y, z_{in} - \delta, p), (x, y, z_{in} - 2 \delta, p), \ldots$

The Clojure implementation of this forward and then backward ray-marching looks like this:

(defn ray-march-z [keep-marching? get-next-z initial-z min-z max-z num-steps]
(let [dz (float (/ (- max-z min-z) num-steps))]
(loop [z initial-z]
(if (keep-marching? z)
(recur (get-next-z z dz)) z))))

(defn ray-march-forward-then-backward [outside? min-z max-z
num-coarse-steps num-fine-steps]
(let [z (ray-march-z #(and (outside? %) (< % max-z))
#(+ %1 %2)
min-z min-z max-z num-coarse-steps)]
(if (< z max-z) (ray-march-z #(and (not (outside? %)) (> % min-z))
#(- %1 %2)
z min-z max-z num-fine-steps) max-z)))


After ray-marching a bunch of pixels, we’ll still only have a buffer of their z-coordinates, and we still need to render them somehow. We can do that by defining a simple lighting model.

Lighting

Now that we have our buffer of z-coordinates (z-buffer), we need to colour it somehow. I’ll take a very simple approach here, just using the good-old Blinn-Phong lighting model.

Before we can do that though, we’ll need surface normals – or some approximation to them. We’ll use the so-called central differencing technique: compute vectors between neighbouring pixels in the z-buffer and take their cross product. The implementation looks like this:

(defn get-buf-vector [buf-pos buf-pos-to-viewport buf-getter size]
(let [before (clamp-range (- buf-pos 1) size)
after (clamp-range (+ buf-pos 1) size)
vp-before (buf-pos-to-viewport before)
vp-after (buf-pos-to-viewport after)
buf-before (buf-getter before)
buf-after (buf-getter after)]
[(- vp-before vp-after) (- buf-before buf-after)]))

(defn get-normal
"Compute a normal to the z-buffer at buf-x, buf-y by differencing neighbours"
[buf-x buf-y
x-buffer-to-viewport y-buffer-to-viewport
zbuffer size]
(let [u (get-buf-vector buf-x x-buffer-to-viewport #(aget zbuffer % buf-y) size)
v (get-buf-vector buf-y y-buffer-to-viewport #(aget zbuffer buf-x %) size)]
(vec-normalize (vec-cross-3d (vector (u 0) 0 (v 1)) (vector 0 (v 0) (v 1))))))


As for the Blinn-Phong lighting itself, it’s just a single function that returns the light intensity given the surface normal, view direction, light direction and material properties.

(defn zero-clamp [v] (if (< v 0) 0 v))

(defn phong-light [normal view-dir light-dir diffuse specular shininess]
(let [half-vec (vec-scale 0.5 (vec-add view-dir light-dir))
n-dot-l (vec-dot normal light-dir)
clamped-diffuse (zero-clamp n-dot-l)
h-dot-v (vec-dot half-vec view-dir)
clamped-specular (zero-clamp h-dot-v)
spec-exp (Math/pow clamped-specular shininess)]
(vec-add (vec-scale clamped-diffuse diffuse) (vec-scale spec-exp specular))))


We can now march some rays and switch on the lights. When we do, we’ll get something like this:

The thing about our render is that, although it’s kind of pretty, and definitely interesting, it’s a bit rough looking. We could try and hack around to fix this. One obvious idea is to try and smooth the surface normals, with a Gaussian blur or something.

If we think about it a little, the real problem isn’t the normals though, it’s the way we’ve done our ray-marching. At least part of the problem is that we don’t always find the first intersection, sometimes we’ll step too far while marching forwards, and then march back to the wrong exterior point. Luckily, computer graphics people have been rendering these types of fractals since the ’80s and there is a different approach to naive ray marching that has nicer properties and is much more efficient.

Distance Estimation

Instead of naively stepping along the rays to find intersection points, it turns out to be simple to compute a so-called unbounding volume from any point in space, that is, a sphere that is guaranteed not to contain any of the fractal. The details of this can be found in the original paper. IQ has also written a few nice articles about this.

With unbounding volumes, we compute a series of distances of decreasing size, which we use to march along our rays. A minimum distance is defined at which we decide to stop marching. This minimum distance, combined with guaranteeing never to over-step the mark and find an intersection that isn’t the closest gives a much smoother look to the renders. They’re also much quicker to generate. Here’s an example of a render using this technique:

The Clojure for doing ray-marching with distance estimation is very simple:

(defn julia-distance-estimate [q c max-iterations]
(loop [iter 0 dz [1.0 0 0 0] z q]
(let [z-length (vec-length z)]
(if (and (<= z-length 4) (< iter max-iterations))
(recur (inc iter)
(vec-add (vec-scale 2.0 (quat-mul z dz)) [1.0 0 0 0])
(/ (* z-length (Math/log z-length)) (vec-length dz))))))

(defn unbounding-ray-march [c max-julia-iterations param max-distance-iterations
epsilon min-z vp-x vp-y]
(loop [iter 0 z min-z dist (+ 1 epsilon)]
(if (and (< iter max-distance-iterations) (> dist epsilon))
(let [ dist-estimate (julia-distance-estimate [vp-x vp-y z param] c max-julia-iterations)]
(recur (inc iter) (+ z dist-estimate) dist-estimate)) z)))


Code

The snippets above are collected together at my Github:

https://github.com/nicknash/quat_julia

Prologue

Everyone loves sorting. Even people who don’t know they do, because our phones, tablets and laptops are always sorting. Keeping things in order is just part of life, and especially so if you’re a computer. As each internet packet boinks through the routers of the Internet, its IP address is compared in an ordered way. So these days little bits of sorting are happening almost constantly. As rain falls, sun shines and humanity breathes: computers sort.

A while ago, I needed to stress test a sorting-like algorithm. It turned out I needed something a bit smarter than say, McIlroy’s quicksort killer. So I came up with the naive (and somewhat impractical) algorithm described below. It’s just a bit of idle curiosity, but I hope it’s interesting. I’ll also present a minor “application” of the algorithm.

So: sorting algorithms sort data. Here, we’re going to enter the mind of the data. We’re also going to be bad. We’re going to think: As a bunch of data, how can I make the job of the sorting algorithm as hard as possible?

Killing Quicksort

We all remember quicksort only works in $\Theta (n \log n)$ time on average. Its worst case is quadratic. Yikes.

What’s more, the famous Doug McIlroy described a simple way of making pretty much any practical quicksort implementation go quadratic.

In C, we sort things by saying

qsort (values, num_values, sizeof(int), compare);

McIlroy’s quicksort killer lives inside the compare function. It never lies, but it doesn’t have anything else in mind other than to make quicksort’s job difficult. To do this, it uses a simple heuristic to guess which element is the pivot, and assigns it the smallest possible value consistent with the results of previous calls to compare. The result is that any quicksort examining only a constant number of elements to determine its pivot will go quadratic, always.

Inside the Killer

The killer works by regarding the entire array to be sorted as gas to begin with. An element of the array qualifies as gas if the sorting algorithm hasn’t asked enough questions about it yet to force the killer to reveal the element’s position in the sorted output. As the sorting algorithm asks questions about the input (by calling compare) , it pins down where some elements belong in the output. The killer keeps track of this by freezing these elements into solid values. It always freezes gas into the smallest consistent solid value available.

The rules for returning results from the compare function are then simple, if we’re comparing:

• Two solid values, just compare them, the killer has already divulged their relative order.
• Solid to gas, the gas is always larger (when the killer freezes the gas later, it’ll be consistent about this)
• Gas to gas, freeze one of them with a value larger than any currently frozen value (i.e. $\texttt{num\_solid++}$)

The final detail is that, in gas to gas comparisons, we’ll freeze the one we think is the pivot. Since we know the partition step repeatedly compares elements to the pivot, we will freeze the gas element that has been involved in the previous comparison (tie-breaking with one or the other if neither has). This matches the goal of trying to freeze the pivot as soon as possible, and thus forcing it to be as small as possible.

McIlroy’s code is here.

Keeping Secrets

McIlroy’s program is pretty neat, but it’s specific to quicksort. It’s imagining those pivot comparisons to decide what to freeze. Can we come up with something that’ll work against any old sorting algorithm, or sorting-like algorithm?

Let’s think about it this way — we’re the space of all possible input arrays. From the sorting algorithm’s point of view, we’re some particular array $x[1\ldots n]$. The sorting algorithm is going to ask us questions, like $\mbox{Is } x[1] < x[2]\mbox{?''}$ . Here’s the key: Each time we answer, we want to give out as little information as possible.

Secret Descendants

Let’s get down to business. We can represent the information the sorting algorithm has extracted about the input $x[1 \ldots n]$ as a directed acyclic graph (DAG) on $n$ nodes, labelled $V_1, \ldots, V_n$.

Initially, the DAG has no edges. An edge from $V_i$ to $V_j$ means that $x[i] < x[j]$. So if the sorting algorithm asks us $\mbox{Is } x[i] < x[j]\mbox{?''}$, we either say “Yes”, and add an edge from $V_i$ to $V_j$ in the DAG, or say “No”, and add an edge from $V_j$ to $V_i$.

Now imagine $V_i$ has 10,000 descendants, and $V_j$ has only 1. If the sorting algorithm asks $\mbox{Is } x[i] < x[j]\mbox{?''}$ we had better say a loud “No”. Answering “Yes” tells it that $x[j]$ is greater than $x[i]$ as well as the 10,000 elements corresponding to the descendants of $V_i$. Instead, answering “No”, tells the sorting algorithm about the relative order of just 3 elements.

DAG in Pictures

Let’s look at a few pictures of what this looks like. Suppose the DAG adversary is being quizzed about an array $x[1\ldots 7]$, and it has already answered a few times, giving:

Now it gets asked $\mbox{Is } x[1] < x[6]\mbox{?''}$. If it says “Yes” we are left with this DAG

Whereas answering “No” gives this one:

Saying “No” told the sorting algorithm more than we had to. It got to know not only that $x[6] \geq x[1]$, but that $x[6]$ isn’t less than any of $x[1]$‘s descendants.

Rules of Secret Descendants

Each time the sorting algorithm asks $\mbox{Is } x[i] < x[j]\mbox{?''}$, it answers as follows:

1. If there is a directed path from $V_i$ to $V_j$ , it says “Yes.”
2. Otherwise, if there is a directed path from $V_j$ to $V_i$ , it says “No.”
3. Otherwise, if $V_i$ has fewer descendants than $V_j$ , it says “No.”, and adds an edge from $V_j$ to $V_i$.
4. Otherwise ($V_j$ has no more descendants than $V_i$) it says “Yes.” and adds an edge from $V_i$ to $V_j$.

The first two rules just maintain consistency with previous answers the DAG adversary has given. The last two try and leak as little information as possible, just as was described earlier.

Does it Work?

If we implement the DAG adversary and give it a whirl against, say C#’s List(T).Sort function, here’s what happens:

Success! It certainly looks like we’ve hurt the C# library function — the red-line is the number of comparisons against input size.  The green-line is a guessed at quadratic matching the measured number of comparisons. This is kind of nice, because the adversary knows nothing in particular about the sorting algorithm’s implementation, unlike McIlroy’s nifty quicksort killer.

Stepping Back

What’s really going on with this DAG? At any instant, it represents the information the sorting algorithm has extracted about the input. What’s more, any topological sorting of the DAG (i.e. a listing of the vertices such that $V_i$ always comes before $V_j$ if there is an edge from $V_i$ to $V_j$) is an ordering of $x[1\ldots n]$ consistent with the questions the sorting algorithm has asked.

Initially, there are no edges, and any ordering of the vertices of the DAG is a topological sort. When the sorting algorithm is done, they’ll just be a single topological sort possible. The DAG will just be one long chain of vertices.

So, the goal of the DAG adversary should be to ensure the number of topological sortings of the DAG is maximized after its answer. Looking at our pictures above, the “Yes” DAG has $5! = 120$ topological sortings, whereas the “No” DAG, has only $2 \times 4! = 48$ topological sortings. Our “count the descendants” heuristic payed off. But why the heuristic, why not just count?

Counting is Hard

So, we have a simple new adversary right? When it gets asked a question, it sees if adding the “Yes edge” or “No edge” gives a DAG with a larger number of topological sortings, and answers accordingly.

Unfortunately, a little Googling convinces us we’d end up with an amazingly slow adversary. Counting topological sortings isn’t just NP-complete, it’s #P-complete.

What about an approximation? More Googling.  That too, is rather slow. Even approximately counting takes (roughly speaking) $O(n^3 (\log L(P))^2)$ time, here $L(P)$ is the number of topological sortings of the DAG (or, linear extensions of the partial order). In our case, early on, since the DAG has no edges, we’ll have $L(P) = n!$, and so the time just to answer one question (i.e. compare two elements!) will be, again ignoring messy terms, $O(n^5)$.

Our DAG is Slow

Even with the simple count-the-descendants heuristic. It’s worth coming to terms with the fact that our DAG adversary is sadly quite slow. A simple implementation of the underlying DAG is an adjacency list representation. This will take $\Theta(1)$ time to add a new a edge, $\Theta(n)$ time to query for a directed path, and $\Theta(n)$ time to count descendants.

This is pretty bad, especially if we’re driving a sorting algorithm into quadratic territory — our program will need cubic time! Perhaps worse, the DAG will have a quadratic number of edges in this case.

Tweaking the DAG

There’s scarily extensive literature on maintaining transitive closures (or reductions) of DAGs — probably relevant if we want to optimize our implementation. Rather than do all that reading though, we’ll just attack the constant factors to get an implementation that’s fast enough to be of use.

It turns out just one idea gives about a factor of 2 speed-up: Our adversary is designed so that the queries for directed paths should almost always not find any path at all (or if you prefer, in the worst case, this is what must happen). This means our path queries will explore all the descendants of their source node. Aaaand, one more thing: we only ever count descendants having performed a path query.

With that little idea in hand, we can implement a DAG that’s only guaranteed to report descendant counts correctly if we’ve just performed a path query — which we’ll force to always explore all descendants. This should save about half the work of the adversary. You’ll be excited to know I’ve tested it and it does.

One or two other uninteresting tweaks give us about a 1.5x speed-up: Removing some pretty Linq and avoiding repeatedly clearing and initializing traversal data by working in “epochs”.

So, is there any use for this DAG adversary? Well, kind of.

A while ago, I wanted to stress test some selection algorithms. These are quicksort-like algorithms that, given some whole number $k$, then re-arrange an array $A$ such that $A[k]$ has everything less than it to its left, and everything greater to its right.

The details of the algorithms don’t matter too much. They’re based on Floyd and Rivest’s SELECT algorithm. Also, they’re variations on “introspective” algorithms. That is, they start by using one algorithm, and if they notice that algorithm approaching its worst case time, they switch to an algorithm with better worst case time. They do this because although the algorithm they switch to is better in the worst case, it’s much slower on average. The idea is to get the good average case performance of say, quicksort, with the worst case guarantee of heapsort.

In the case of these selection algorithms, the algorithm they switch to is labelled “Fallback” (red-line). Here’s a plot of their performance against McIlroy’s quicksort killer:

The plot above should be a bit worrying. Both the Intro-SELECT and Hybrid-SELECT algorithm switch to the Fallback when they hit their primary algorithm’s worst case. They spend some time realizing things are going badly, so in those cases, they perform worse than just running the Fallback in the first place. Except that, in the plot above, Hybrid-SELECT is doing better than the Fallback: It hasn’t encountered its worst case, despite the efforts of McIlroy’s quicksort killer. Meanwhile, Intro-SELECT has hit its worst case — it’s taking longer than the Fallback.

The next plot shows what happens against the DAG adversary:

Here, we can see what we expect (well, except for that weird dip for Hybrid-SELECT — I haven’t looked into that!) the Fallback is fastest, and the two other algorithms perform worse. The DAG adversary has forced them to switch to the Fallback after some wasted work.

So, in this rather unusual situation, the DAG adversary is more powerful than McIlroy’s quicksort killer. Of course, it’s also much slower, but at least we’re getting something in return for the time and CPU heat.

Code

I’ve implemented the adversaries described here in C#. I’m sure you’re just itching to take a look: