Hacker's Delight 6 : 64 Choose 4
Challenge: Write the fastest function possible to enumerate all possible combinations of 4 bits in a 64-bit word.
View the complete problem statement.
Thank you for your tremendous interest in this Hacker's Delight challenge. We received over a hundred individual submission, representing every Electronic Arts studio, and some affiliates. Your enthusiasm in these challenges is what keeps us motivated to bring you more.
How we tested your submissions
Each and every submission was hooked up into our testbed, and interspersed with other people's submissions, was run 1200 times. Of these twelve hundred executions, only the fastest time was recorded for each entry (not the average time). This was to reduce thread preemption, and other non-determanistic behavior from compromising the testing. Although we saw slight differences in measured time, the "order" of the winning solutions was the same over 300, 600, and 1200 runs. So, while the absolute time measurements may never be completely correct on a pc, they seem to be correct in a relative sense.
The Reference Solution
The snippet of code below is a naive reference solution that we employed as a comparison point against your solutions. It's simple and un-optimized. This reference solution ran, at best, 5.0759 ms.

Did we forget? It's All About Memory...
As the responses started rolling in, we noticed that each and every one performed almost exactly as well as the reference solution. It soon became obvious to us that most solutions were bottlenecked on writing to memory. On that subject, several entries pre-computed the entire answer, and their entry was just a giant memcpy(). This, while it sounds like a great cheat, is horribly slow (almost 5mb must be pulled into the CPU, and written out). Welcome to moving memory around on a modern PC. A Theoretical Max speed? At Hacker's headquarters, we did a few tests with both SSE and MMX instructions, and with the help of some code donated to us by Charlie Skilbeck (EA-UK), we were able to simply MemFill() the entire 4.84 megabyte buffer in - at best - 1.6832 ms. So, theoretically this is the fastest any submission should be able to correctly execute on our test machine*.* Single-core Intel 3.0Ghz P4 with 2 logical processors. Assumption is a single-threaded solution that uses the CPU. Sadly, no GPU entries were received.
A hand-full of submitters realized this "bandwidth-limitation". They applied the "MMX" instruction set, (and associated streaming store instructions). Ultimately, every single MMX solution beat every non-MMX solution (several times over). The fastest solution using standard (not-write-combined) stores was hit by Brad Reimer (EAC), weighing in at just under 4.6ms. This is also the fastest "straight C" solution.
Congratulations to Cedrick Collomb (EA-UK), for the fastest submission. (Although, clearly it was a very tight finish). It is quite impressive to see the top entries approach the theoretical speed of a memory fill. Cedrick's entry is measurably slower than the optimized MemFill, but only slightly.
A special thanks to the Florida Interactive Entertainment Academy for your enthusiastic participation, and to J. Gritton (FIEA) for the top submission from the group.
Submissions Under 10ms
Rank |
Submission Name |
Cycles per element |
(time in ms) |
Theoretical Max Bandwidth |
9.5370 | 1.6832 | |
1 |
9.5378 | 1.6834 | |
2 |
Choose64_4_glewin_mmx |
9.5440 | 1.6845 |
3 |
Choose64_4_jleesteere_mmx |
9.5459 | 1.6848 |
4 |
Choose64_4_dwinter_mmx |
9.6116 | 1.6964 |
5 |
Choose64_4_Senzee_mmx |
9.6583 | 1.7046 |
6 |
Choose64_4_grahamh_mmx |
9.7511 | 1.7210 |
8 |
28.0791 | 4.9558 | |
9 |
choose64_4_ringram |
28.1353 | 4.9657 |
10 |
Choose64_4_johnwhite |
28.1482 | 4.9680 |
11 |
FIEA_Choose64_4_JGritton |
28.2336 | 4.9830 |
12 |
Choose64_4_Senzee_fast |
28.3022 | 4.9952 |
13 |
Choose64_4_iseale |
28.3128 | 4.9970 |
14 |
FIEA_Choose64_4_pdutka |
28.3184 | 4.9980 |
15 |
FIEA_Choose64_4_tcarbone1 |
28.3354 | 5.0010 |
16 |
Choose64_4_msanchez |
28.3355 | 5.0010 |
17 |
Choose64_4_tdills |
28.3783 | 5.0086 |
18 |
Choose64_4_julianperez |
28.3932 | 5.0112 |
19 |
Choose64_4_mwhitton78 |
28.4043 | 5.0132 |
20 |
FIEA_Choose64_4_nforget |
28.4063 | 5.0135 |
21 |
FIEA_Choose64_4_ckeyser |
28.4654 | 5.0240 |
22 |
Choose64_4_darrin1 |
28.4691 | 5.0246 |
23 |
Choose64_4_mweisel |
28.4915 | 5.0286 |
24 |
FIEA_Choose64_4_tcarbone0 |
28.5074 | 5.0314 |
25 |
Choose64_4_darrin2 |
28.5608 | 5.0408 |
26 |
FIEA_Choose64_4_RogierVanEtten |
28.6210 | 5.0514 |
27 |
FIEA_Choose64_4_ccamilleri |
28.6399 | 5.0548 |
28 |
Choose64_4_Senzee_faster |
28.6419 | 5.0551 |
29 |
Choose64_4_darrin3 |
28.6601 | 5.0583 |
30 |
Choose64_4_Senzee_lookup32 |
28.7206 | 5.0690 |
Reference_Solution |
28.7595 | 5.0759 | |
31 |
Choose64_4_CrowtherAntony |
29.8312 | 5.2650 |
32 |
Choose64_4_ljinsong |
29.8368 | 5.2660 |
33 |
Choose64_4_mwhitton_2 |
31.3545 | 5.5339 |
34 |
Choose64_4_tmcsweeney |
31.6163 | 5.5801 |
35 |
Choose64_4_spenso |
38.9861 | 6.8808 |
36 |
FIEA_Choose64_4_gfields |
39.4063 | 6.9550 |
37 |
Choose64_4_glewin |
40.8485 | 7.2095 |
38 |
Choose64_4_jqg |
47.3640 | 8.3594 |
39 |
Choose64_4_nheckel |
55.0565 | 9.7171 |
Honorable Mentions
Paul Senzee's (Tiburon) solution clocked in at an amazing 0.0001ms (!!), and passed our 'correctness' validation tests. While his submission was disqualified from the main competition (he is part of the HD staff - so he knows how to cheat), his technique is interesting enough that we would like to showcase it below. Don't hurt your brain too much trying to figure out how he did it. Bravo Paul : )

As Hacker's Delight is never all about speed, we would like to recognize Paul Pedriana (Maxis) for the "Most Elegant Solution in C",

--hackers