NOCAP: A not-so-useful precompilation framework

A walk through my final project of university

I deeply enjoyed Advanced Compilers - my final technical course in university and my first dabble into formally studying compilers. For the final project, I teamed up with classmates to build a compilers or compilers-adjacent research project.

Influenced by some ideas present in this paper presenting an approximate computing framework, we sought to implement build ideas in approximate computing, a study that trades program accuracy for improvements in speed and/or memory usage. The quintessential example of approximate computing is loop perforation, which modifies the source to skip loop iterations. After surveying other relevant papers, we converged on a few actionable ideas:

  • Expand ACCEPT to contain more approximate computing techniques. Ironically, I could not compile the ACCEPT compiler framework after days of hacking away and squashing compiler bugs as they appeared (dependencies suck). That idea was out of question
  • Implement alternate algorithm selection (such as using a less accurate but fast matrix multiplication algorithm in certain scenarios for image processing or machine learning training/inference) or alternate data structure selection (can we determine at compile time which implementation of a key-value store is best for the workload?)

While whiteboarding through some ideas and potential timelines, two techniques lingered in my mind: small-angle approximation (the property that $\sin(x) \approxeq x$ when $x$ is small) and Taylor approximations (you can estimate continuous functions with some derivatives). We realized that, in general, if we assume that functions are continuous, we can approximate values quite easily. The accuracy of the approximations depends on the function’s gradient and the point around which we find the approximation. A flat horizontal line is the perfect case. We decided to maximize speed by assigning a fixed value to each input interval - equivalent to a 0th order Taylor approximation.

Thus Nearby-operand Continuous Approximation (NOCAP) was borne. The user annotates candidate double -> double functions to annotate. The NOCAP profiler identifies a range of inputs based on sample input. Then, with user-specified memory usage, NOCAP replaces function calls to candidate functions with table queries, falling back to the original if the input is out of range. NOCAP trades accuracy and memory usage for speed.

We used two benchmarks - a toy example using the built-in exp() function in math.h - and an implementation of Black-Scholes that uses exp(), log(), sqrt() in math.h. In the toy example, NOCAP shows a 60% speedup for 200 million calls to exp() but less than a 1% speedup for Black-Scholes. These speedups are statistically verified. NOCAP demonstrates speedup but in code where math functions are such a small part of CPU usage, the speedup is negligible. Exp is also a bad example to use NOCAP outside of demonstrating speedup because of accuracy losses resulting from the very large and very small gradients at both ends of the input spectrum.

However, table lookup could suffer a bottleneck to cache size/RAM size.

Get ready for some napkin math…

Napkin math

Based on ChatGPT and a Google/Quora/Stack Overflow for “modern processors” (I can’t be bothered to keep track of sources for napkin math), addition/subtraction/multiplication/L1 cache hit (~1-2 cycles) <= division/L3 cache hit (~30 cycles) <= local SRAM/DRAM (~100 cycles or more) (SRAM is faster). Are these numbers totally accurate? Definitely not, but I think we can see a general pattern arise. It takes a lot of arithmetic to be slower than RAM access but not that much to be slower than an L3 cache hit.

Suppose we have some modern hardware with a relatively large L3 cache - 32MB. NOCAP can accommodate a max of 4 million table entries. With a .01 interval size, we can only accommodate a max range of 40,000. Instead, if we spend $15,000, we can get a 1.1GB L3 cache, giving us a max range of ~1.3 million - a bit better!

Maybe the function is really slow, and using RAM is worth it. 64GB of RAM affords you ~83 million range and 1TB of RAM affords you ~1.3 billion range - a lot better!

Note that the ranges aren’t quite correct because doubles are stored using the IEEE floating point technique.

Conclusion

NOCAP was a fun project to work on and maybe the ideas presented are useful for a specific use case. However, I don’t think NOCAP is useful for 99.999999% of continuous functions, especially if only arithmetic is used to compute it (e.g. no blocking operations or syscalls). Perhaps it’s better than a lot of division but I have not verified that claim. Practically, L3 cache hits are the most frequent NOCAP queries, and the L3 cache is almost always too small for any reasonable range of inputs.

Enjoy demo slides and our paper on the project.

Demo Slides

Paper

Neel Shah
Neel Shah
Engineer