Dark Origins

This is a short post about a tough puzzle I came across a few months back, originally at Two Guys Arguing. Since it’s the Matrix Puzzle From Hell, and it’s nearly Halloween, I thought it was ingeniously topical. I’m sure you agree.

If anyone knows the true origin of the puzzle, I’d love to hear it.

The Puzzle

A matrix has the property that every entry is the smallest nonnegative integer not appearing either above it in the same column or to its left in the same row.

Give an algorithm to compute the entries of this matrix.

An Example

Here’s an example matrix:

0 1 2 3 4 5 6 7
1 0 3 2 5 4 7 6
2 3 0 1 6 7 4 5
3 2 1 0 7 6 5 4
4 5 6 7 0 1 2 3
5 4 6 7 1 0 3 2
6 7 4 5 2 3 0 1
7 6 5 4 3 2 1 0

It looks like this matrix is full of patterns, so what’s a good algorithm for computing its entries?

As with any puzzle, it’s probably worth trying it before reading the spoilers below. I’ll reveal the answer slowly though.

A (Very) Naive Algorithm

Without really thinking about the puzzle, we can at least give a correct algorithm. To compute a particular entry, compute all the entries above it, and to its left, and choose the smallest nonnegative integer not included in the union of those two.

If we’re considering an $n \times n$ matrix, then to compute a given entry we can be forced to compute up to $\Theta(n^2)$ other entries, because we need to compute all the entries above it and to its left (and all the entries above them and to their left). Once we have the entries above us and to our left we can sort them, in say, $\Theta(n \log n)$ time and then find the smallest excluded element with a linear scan. So the algorithm takes $\Theta(n^3 \log n)$ time.

This certainly feels very naive. We’re not using any of the problem’s structure. At least we have a time complexity though.

Patterns

Let’s look at a small example matrix

0 1 2 3
1 0 3 2
2 3 0 1
3 2 1 0

There’s lots of symmetry here, and quite a few patterns to consider. Let’s just look at the top-left 2×2 entries though:

0 1
1 0

To generate the 2×2 sub-matrix of entries in the top-right of our original 4×4 matrix, we just add 2 to these entries. To generate the 2×2 sub-matrix of entries in the bottom-left, we just add 2 to these entries. To generate the 2×2 sub-matrix in the bottom-right, we just copy these entries.

We now have our 4×4 matrix, if we extend it in the same way, adding 4 this time, we’ll get our original 8×8 example matrix.

We can even start with a 1×1 matrix: 0, and extend it into the initial 2×2 matrix, by adding one (and copying).

So this looks like a very useful pattern. We’ve not thought about why, or if, it’s right just yet. It looks right though. So let’s try and bash out a better algorithm.

A Better Algorithm

It looks like we can generate the entire matrix by the “doubling” pattern just described: Start with the 1×1 matrix (which is just 0). Double it to give the 2×2 matrix, double that to give a 4×4 matrix, etc.

To pick out just a single entry, generating the whole matrix isn’t going to be the right way to go though. We’d still only be down to $\Theta(n^2)$ time (and space) — not as bad as $\Theta(n^3 \log n)$, but we can do better.

Instead, we can work backwards. To work backwards from a given row and column $r, c$, we just need to find the smallest number $m$ such that $r, c < 2^m$. Then we know which $2^m \times 2^m$ matrix the entry “belongs” to. Then we find out which quadrant of that matrix the entry is in, and then which quadrant of that quadrant the entry is in, and so on, until we reach the top-left position.

More precisely, we’re saying of our matrix $M$ that given $r, c$:

• Choose the largest integer $p$ such that $2^p \leq r$
• Choose the largest integer $q$ such that $2^q \leq c$
• If $p = q$ (i.e. bottom-right quadrant), then the answer is $M[r - 2^p, c - 2^p]$
• If $p > q$ (i.e. bottom-left quadrant), then the answer is $2^p + M[r - 2^p, c]$
• If $p < q$ (i.e. top-right quadrant), then the answer is $2^q + M[r, c - 2^q]$

Repeating this until we get to $r = c = 0$ only takes $\Theta(\log \max\lbrace r, c\rbrace)$ time (because we always more than half either our row or column position), and constant space. That’s a lot better than our first attempt.

But can we do even better still? It turns out we can. A lot better.

Harder, but Easier

When I first heard this puzzle, there was a condition attached to the time complexity of the solution. If it had been that there is a solution working in $O(n^{1/3}(\log n)^2)$ time, I think I would have taken forever trying to think up some complicated algorithm to solve it.

Instead, it said the solution had to work in $O(1)$ time and space per entry. In theory, having to solve a problem in a better time is harder. In the land of puzzles though, knowing we can do it in constant time makes getting to an answer easier. It’s almost a spoiler.

Knowing it can be done in constant time and space means there must be a formula for the entries. Let’s start simple, and go through our tools. Does $M[r, c] = r + c$. No, obviously not. What about the good old bitwise operators though? How about $r \mbox{ OR } c$? Well, no because $1 \mbox{ OR } 1 = 1$ but we can see $M[1, 1] = 0$ in our examples.

What about XOR? It turns out that’s it. We can just XOR our row and column numbers to get the entry. Try it!

But Why?

It’s a bit of a shocker to know the answer is just plain-old XOR. Who knew it computed the minimum excluded number in a matrix?

Well, if we look at our logarithmic time algorithm, it’s pretty easy to see that it’s nothing more than an XOR of the row and column. When it said “Choose the largest integer $p$ such that $2^p \leq r$“, we’re really asking about the most-significant bit of our row number. It’s the same with our column, then we compare these bits, in fact, XORing them into our answer, and repeat with the remaining bits.

This at least explains how we get to XOR from our quadrant-based algorithm. We still haven’t really explained why our initial “doubling” construction was correct though. Before we do that, let’s have a look at a pretty picture.

Pretty Picture

As a neat little aside, it turns out that if you draw a picture of our matrix (zero being black), here’s what you get:

Pretty much any demoscener (or graphics hacker) will recognize this as the XOR texture, the easiest of all textures to generate. Just try googling it!

Speaking of the demoscene, here’s video of 4k Amiga intro making extensive use of the XOR pattern (not usually a badge of honor in the demoscene world, incidentally): Humus 4.

I think it’s neat to know that texture is all about minimum excluded numbers, after all these years.

But Really, Why?

We never really explained why our doubling construction is actually correct. It’s pretty obvious, but let’s see about that now.

Suppose we have a $2^k \times 2^k$ version of our matrix $M$. Our doubling construction gives us a $2^{k+1} \times 2^{k+1}$ matrix, such that:

• The upper-left corner is $M$
• The bottom-left corner is the entries of $M$ plus $2^k$
• The bottom-right corner is $M$
• The upper-right corner is the entries of $M$ plus $2^k$

Now let’s think about the simplest version of our matrix. If $k = 0$, we have a $1 \times 1$ matrix (containing a single entry, 0) for which every row and every column contains all the numbers in $\lbrace 0, \ldots, 2^k - 1\rbrace$. Because of this property of its rows and columns, we’ll call that matrix “complete”.

If we apply our doubling construction, we have a $2 \times 2$ matrix that is also complete. Inductively, it’s easy to see we’ll continue to get complete power-of-two-sized matrices by doubling complete power-of-two-sized matrices.

The question is now, in a doubled matrix, is each entry the minimum excluded number? Consider any $2^{k + 1} \times 2^{k + 1}$ matrix of minimum excluded numbers. Take the upper-right corner (UR), it can only contain numbers in $\lbrace 2^k, \ldots, 2^{k + 1} - 1 \rbrace$, both in order to guarantee exclusion (because the upper-left corner (UL) is complete) and also in order that the entries are as small as possible. Now, if any entry in UR was smaller than $2^k$ plus the corresponding entry in UL, then we could subtract $2^k$ from all the numbers in UR and get a $2^k \times 2^k$ matrix of minimum excluded numbers that had a smaller entry than UL somewhere. This is obviously impossible, because we know UL already contains the (unique) minimum excluded numbers in each entry.

We can make similar arguments for the other two quadrants. That’s another way to see the XOR connection. The upper-left and bottom-left quadrants are restricted in the numbers they can choose as entries because they have only either a different row or a different column (i.e. exclusive or) to the entries in the top-left quadrant. Whereas, the entries in the bottom-right quadrant are unrestricted by those in the top-left quadrant having both a different row and column.

Nim, Nimbers and Beyond

To finish up, it’s worth noting that there’s a kid’s game played with piles of sticks called Nim, which the solution to this puzzle represents a winning strategy for.

In fact, there are even things called Nimbers, and XOR corresponds to Nimber addition. There’s also a Nimber multiplication operator. I wonder what it looks like as a texture, compared to what the XOR texture looks like?

Knuth’s Wisdom

Prologue

In a previous post I talked about trying to bring out the worst in sorting algorithms. Given the torrent of interest in that post, I’m going to continue with the sorting theme. So, in this post, I’m going to talk about a sorting algorithm that brings out the worst in a CPU: bubble sort. This is clearly a point of deep industrial and practical interest, so, fasten your seat belts!

It turns out that the real reason a bubble sort performs so badly on today’s CPUs is just a little surprising. Actually, I wrote about this some time ago with some friends in a paper, but it was a long paper, and I felt this fun little bubble sort nugget got a bit lost inside it.

Bubble Sort

Mundane and all as it is, let’s remind ourselves what bubble sort is.

For an array of $n$ elements, bubble sort works by performing roughly $n$ sweeps over the array. A sweep just scans from left-to-right, swapping an element with its immediate right neighbour if that neighbour is smaller than it.

So in C, the sweep step might look like this:

void maybe_swap(int* a, int i)
{
if(a[i + 1] < a[i])
{
swap(a[i + 1], a[i]);
}
}

void bubble_sweep(int* a, int n, int sweep_number)
{
for(int i = 0; i < n - sweep_number; ++i)
{
maybe_swap(a, i)
}
}

In all its glory, bubble sort just calls the function above with “sweep_number” ranging from 1 to $n - 1$. So it looks like this:

void bubble_sort(int* a, int n)
{
for(int i = 1; i < n; ++i)
{
bubble_sweep(a, n, i);
}
}

I’m ignoring an improvement that can be made to bubble sort here. That is, if no elements are swapped in an entire sweep, the whole algorithm can just terminate. I’m leaving that improvement out because adding it makes the point I’m trying to make in this post a bit messier without adding anything very interesting.

Without getting too side-tracked into implementations, I can’t resist including one in Clojure:

(defn maybe-swap [v i] (let [j (+ i 1)] (if (> (v i) (v j)) (swap v i j) v)))
(defn bubble-sweep [v sweep-num]
(reduce #(maybe-swap %1 %2) v (range (- (count v) sweep-num))))
(defn bubble-sort [v] (reduce #(bubble-sweep %1 %2) v (rest (range (count v)))))

With our coding done, we can run a simple experiment that has a surprising result.

An Experiment

A quick look at the bubble sort implementation above should convince us it takes $\Theta(n^2)$ time. This is obvious: each call to “bubble_sweep” takes $\Theta(n)$ time, and we make $\Theta(n)$ calls.

If we were feeling very curious, we might try timing individual calls to “bubble_sweep” as it’s called by bubble sort, just to see that the time taken is indeed linear each time. If we do that, here’s what we get:

The vertical axis here is time (shown in CPU cycles), and the horizontal axis is the “sweep_number” from our code. The input array is 50,000 32-bit integers.

I don’t know about you, but until now I felt like I understood the performance of the humble-looking “bubble_sweep”. Now, I’m confused. Of course we know it’s really taking $\Theta(n)$ time per sweep. It just looks like the constant hidden in there is varying pretty strangely. How come?

Not the Caches

A natural first thought is that maybe the number of swaps varies per sweep, and more swaps are more expensive? Maybe things are exacerbated by a CPU cache effect?

This is jumping the gun a little. It’s not a cache effect, because the entire array fits in the (L2) cache where I did the measurements. The first half of this idea is good though.

The number of swaps does vary per sweep, and the real expense is due to branch mispredictions.

Due to What?

Let’s quickly remind ourselves what branch mispredictions are. CPUs don’t just work on a single instruction at once. Instead, they divide the work for each instruction into say, 5 stages. Then after one instruction finishes stage 1, the CPU fetches the next instruction and starts it in stage 1 (while the first instruction is now in stage 2). This is called pipelining. In this case we’d have a 5 stage pipeline.

So if processing a single instruction took time $T$, and we divide it into $N$ stages, we still take time $T$ to process a single instruction (and actually, a little longer because of the expense of administering the pipeline). On the other hand, once the pipeline is full, a new instruction is completed every $T/N$ units of time. This is great. There’s a little extra latency for each instruction, but we’ve improved throughput by a factor of $N$.

The wrinkle of course is that the CPU needs to know the next instruction to keep its pipeline full. When we have conditional branches (like that “if” statement in “bubble_sweep” above), the CPU can’t know the next instruction. So it has to make a guess. The hardware on the CPU to do this is called a branch predictor.

When the branch predictor correctly guesses the next instruction, the pipeline can stay full. When it doesn’t it needs to abort the in-flight instructions and re-start from the correct instruction. Depending on the length of the pipeline, this can be very expensive.

Another Experiment

Let’s take a look at another experiment to try and confirm our branch prediction theory.
The top-most curve in this image, labelled “hardware predictor”, is the number of branch mispredictions, measured from the performance counter on the CPU, caused by the calls to “bubble_sweep” in bubble sort.

It certainly looks like branch mispredictions are indeed what’s at work. But why?

Intuitive Analysis

Looking at our experimental results again, it seems like bubble sort’s branches get harder and harder to predict, before becoming easier again. So what’s going on?

Thinking very intuitively, we might imagine that the swapping of array elements to the right in bubble sort is a kind of partial sorting. When the data is unordered, the branch is likely to be taken, when the data is somewhat ordered, it’s 50/50, and when the data is nearly sorted it’s very unlikely to be taken. So perhaps it’s moving between these two extremes of predictability that confounds the branch predictor.

We can do better than this though.

Detailed Analysis

Let’s think about the probability that a particular comparison branch is taken. So let $Q^k_l$ be the probability that the $l^{th}$ comparison branch of the inner-loop (i.e. “bubble_sweep”) is taken on the $k^{th}$ sweep of bubble sort.

Now, the $l^{th}$ inner-loop comparison branch can be taken only if $\texttt{a[l]}$ is not larger than all of $\texttt{a[0..l - 1]}$. For this to be the case, the $(l + 1)^{th}$ branch in the previous sweep must have been taken, since otherwise the previous sweep left the maximum of $\texttt{a[0..l - 1]}$ in $\texttt{a[l]}$. But the probability that the $(l + 1)^{th}$ branch of the previous sweep was taken is just $Q^{k-1}_{l + 1}$.

Given that the $(l + 1)^{th}$ branch of the previous sweep was taken, that probability that $\texttt{a[l]}$ is not larger than all of $\texttt{a[0..l - 1]}$ is $1 - 1/(l + 1)$, and so we have:

$Q^k_l = Q^{k-1}_{l + 1}\left(1 - \frac{1}{l + 1}\right)$

The penultimate step is just a result of the fact that the probability that the $l^{th}$ element of a random permutation is a left-to-right maximum is $1/l$.

Filling in the bases $Q^1_l = 1 - 1/(l + 1)$ for $1 \leq l < n$ and $Q^1_n = 0$, we see that when $l \leq n - k$ then $Q^k_l = \frac{l}{l + k}$, and $Q^k_l = 0$ otherwise.

So as our experiment showed us, and our rough intuitive analysis too, in the first sweep, $k = 1$, and the branches are highly likely to be taken, on the other hand, on the last sweep $k = n - 1$ and the branches are unlikely to be taken.

Or, said another way, this analysis tells us that initially left-to-right maxima are quite rare and branches are predictable, but the repeated shifting of them to the right peppers them throughout the array making branches unpredictable, until they are dense enough that things become predictable again.

Matching the Hardware

To round off our mighty analysis, we can try and reproduce the curves we saw in our experiments where we measured the total number of branch mispredictions per sweep.

If we let $S(k)$ be the average total mispredictions in sweep number $k$, then we know
$S(k) = \sum_{l=1}^{n-k}M(Q^k_l)$. Where $M(p)$ is the probability that a branch that is taken with probability $p$ is mispredicted.

One way of modelling $M$ is just to imagine an ideal branch predictor: if the branches are taken (independently) with probability $p$, then the ideal branch predictor can be wrong with probability at most $\min(p, 1 - p)$. This will of course, lower bound the branch mispredictions, because in reality branch predictors work hard to guess what $p$ is.

Filling in for the ideal predictor gives $S(k) = \sum_{l=1}^{n-k}\frac{1}{l + k}\min(l, k)$. After a lot of messy algebra we’ll find that $S(k) = (1 + H_n + H_k - 2H_{2k})k$ when $k \leq n/2$ and $S(k) = n - (1 + H_n - H_k)k$ otherwise (here $H_n \approx \ln n$ is the $n^{th}$ harmonic number).

Plotting this gives us the bottom-most curve, labelled “perfect predictor”, in the branch mispredictions plot above — lower bounding the real number of mispredictions, as we expect.

Shaker Sort

Bubble sort has been around a long time, and people have tried to hack it into something more efficient. One attempt at this is shaker sort. In shaker sort, there are two “bubble_sweep”s that alternate. One shifts large elements right as we did above, and the other shifts small elements left (the opposite of what we did above).

Shaker sort still takes $\Theta(n^2)$ time, but is more efficient than a naive bubble sort in practice. Shaker sort is still a dog though. A selection sort implementation easily beats it, while being much simpler (and insertion sort is better still).

But, this is where the plot thickens: the reason shaker sort is so much slower than selection sort is almost entirely down to branch mispredictions. If you care to measure, shaker sort causes fewer cache misses than selection sort, and executes a small number of extra instructions — but performs much worse due to branch mispredictions. On an unpipelined machine, this bubble sort variant might actually beat selection sort.

So What?

If there’s anything to conclude here it’s that, first and foremost, bubble sort is awful! Not only is it (in naive implementations) more instruction hungry and cache unfriendly than the other quadratic-time sorting algorithms — it manages to be fantastically confusing to branch predictors, on average at least.

The relative performance of the quadratic-time sorting algorithms is important in practice. This is the reason good quicksort implementations know to switch to insertion sort (which, as it happens causes only a linear number of branch mispredictions despite executing a quadratic number of branches).

As a result, I think it’s important to understand what’s going on behind the simple looking code of the quadratic-time sorting algorithms. It also shows that even simple and very similar looking pieces of code can behave quite differently.

And lastly, all this messing with bubble sort reminds us of Don Knuth’s near omniscience. In Volume 3 of The Art of Computer Programming, he remarks:

In short, the bubble sort seems to have nothing to recommend it, except
a catchy name and the fact that it leads to some interesting theoretical
problems.

Knuth isn’t referring to instruction pipelines when he makes this comment, but somehow he’s right about how bubble sort gets on with them too!

Discussion at HackerNews
Code from our paper.