Hacker's Delight 6 : 64 Choose 4

 

Challenge: Write the fastest function possible to enumerate all possible combinations of 4 bits in a 64-bit word.
View the complete problem statement.

Thank you for your tremendous interest in this Hacker's Delight challenge. We received over a hundred individual submission, representing every Electronic Arts studio, and some affiliates. Your enthusiasm in these challenges is what keeps us motivated to bring you more.

 

How we tested your submissions

Each and every submission was hooked up into our testbed, and interspersed with other people's submissions, was run 1200 times. Of these twelve hundred executions, only the fastest time was recorded for each entry (not the average time). This was to reduce thread preemption, and other non-determanistic behavior from compromising the testing. Although we saw slight differences in measured time, the "order" of the winning solutions was the same over 300, 600, and 1200 runs. So, while the absolute time measurements may never be completely correct on a pc, they seem to be correct in a relative sense.

 

The Reference Solution

The snippet of code below is a naive reference solution that we employed as a comparison point against your solutions. It's simple and un-optimized. This reference solution ran, at best, 5.0759 ms.

 

Did we forget? It's All About Memory...

As the responses started rolling in, we noticed that each and every one performed almost exactly as well as the reference solution. It soon became obvious to us that most solutions were bottlenecked on writing to memory. On that subject, several entries pre-computed the entire answer, and their entry was just a giant memcpy(). This, while it sounds like a great cheat, is horribly slow (almost 5mb must be pulled into the CPU, and written out). Welcome to moving memory around on a modern PC. A Theoretical Max speed? At Hacker's headquarters, we did a few tests with both SSE and MMX instructions, and with the help of some code donated to us by Charlie Skilbeck (EA-UK), we were able to simply MemFill() the entire 4.84 megabyte buffer in - at best - 1.6832 ms. So, theoretically this is the fastest any submission should be able to correctly execute on our test machine*.

* Single-core Intel 3.0Ghz P4 with 2 logical processors. Assumption is a single-threaded solution that uses the CPU. Sadly, no GPU entries were received.

A hand-full of submitters realized this "bandwidth-limitation". They applied the "MMX" instruction set, (and associated streaming store instructions). Ultimately, every single MMX solution beat every non-MMX solution (several times over). The fastest solution using standard (not-write-combined) stores was hit by Brad Reimer (EAC), weighing in at just under 4.6ms. This is also the fastest "straight C" solution.

Congratulations to Cedrick Collomb (EA-UK), for the fastest submission. (Although, clearly it was a very tight finish). It is quite impressive to see the top entries approach the theoretical speed of a memory fill. Cedrick's entry is measurably slower than the optimized MemFill, but only slightly.

A special thanks to the Florida Interactive Entertainment Academy for your enthusiastic participation, and to J. Gritton (FIEA) for the top submission from the group.

Submissions Under 10ms

Rank
Submission Name
Cycles per element
(time in ms)
Theoretical Max Bandwidth
9.5370 1.6832
1
9.5378 1.6834
2
Choose64_4_glewin_mmx
9.5440 1.6845
3
Choose64_4_jleesteere_mmx
9.5459 1.6848
4
Choose64_4_dwinter_mmx
9.6116 1.6964
5
Choose64_4_Senzee_mmx
9.6583 1.7046
6
Choose64_4_grahamh_mmx
9.7511 1.7210
8
28.0791 4.9558
9
choose64_4_ringram
28.1353 4.9657
10
Choose64_4_johnwhite
28.1482 4.9680
11
FIEA_Choose64_4_JGritton
28.2336 4.9830
12
Choose64_4_Senzee_fast
28.3022 4.9952
13
Choose64_4_iseale
28.3128 4.9970
14
FIEA_Choose64_4_pdutka
28.3184 4.9980
15
FIEA_Choose64_4_tcarbone1
28.3354 5.0010
16
Choose64_4_msanchez
28.3355 5.0010
17
Choose64_4_tdills
28.3783 5.0086
18
Choose64_4_julianperez
28.3932 5.0112
19
Choose64_4_mwhitton78
28.4043 5.0132
20
FIEA_Choose64_4_nforget
28.4063 5.0135
21
FIEA_Choose64_4_ckeyser
28.4654 5.0240
22
Choose64_4_darrin1
28.4691 5.0246
23
Choose64_4_mweisel
28.4915 5.0286
24
FIEA_Choose64_4_tcarbone0
28.5074 5.0314
25
Choose64_4_darrin2
28.5608 5.0408
26
FIEA_Choose64_4_RogierVanEtten
28.6210 5.0514
27
FIEA_Choose64_4_ccamilleri
28.6399 5.0548
28
Choose64_4_Senzee_faster
28.6419 5.0551
29
Choose64_4_darrin3
28.6601 5.0583
30
Choose64_4_Senzee_lookup32
28.7206 5.0690
Reference_Solution
28.7595 5.0759
31
Choose64_4_CrowtherAntony
29.8312 5.2650
32
Choose64_4_ljinsong
29.8368 5.2660
33
Choose64_4_mwhitton_2
31.3545 5.5339
34
Choose64_4_tmcsweeney
31.6163 5.5801
35
Choose64_4_spenso
38.9861 6.8808
36
FIEA_Choose64_4_gfields
39.4063 6.9550
37
Choose64_4_glewin
40.8485 7.2095
38
Choose64_4_jqg
47.3640 8.3594
39
Choose64_4_nheckel
55.0565 9.7171
* Click highlighted rows for a link to the code

 

Honorable Mentions

Paul Senzee's (Tiburon) solution clocked in at an amazing 0.0001ms (!!), and passed our 'correctness' validation tests. While his submission was disqualified from the main competition (he is part of the HD staff - so he knows how to cheat), his technique is interesting enough that we would like to showcase it below. Don't hurt your brain too much trying to figure out how he did it. Bravo Paul : )

 

As Hacker's Delight is never all about speed, we would like to recognize Paul Pedriana (Maxis) for the "Most Elegant Solution in C",

--hackers