Improving raytracing performance by improving libbu

Development Log

Go to current week

Week 1

Created this document; sent agreement email; set up a development environment on macOS and Windows.
Submitted a patch for bu_ptbl unit tests.

Week 2

Began reading the librt code in order to understand how the raytracer works, starting with structures (e.g., struct soltab, struct partition).
Began writing some notes on how the parts of librt relevant to my project work. I'm still writing it as I learn more, and I expect there to be some inaccuracies. After fixing those, if people think it's useful, I will post it on the wiki.
Ran rt(1) with librt debug flags (DEBUG_PARTITION) enabled in order to better understand the code path.
Figured out how the tree for a region is built.

Week 3

Discussed the code I read so far and cleared up some misunderstandings on Zulip.
Completed the GSoC checklist.
Created plots of the calls to bool_eval (seg.svg and sol.svg) in order to better understand its behaviour.
Understood the current issues with bu_bitv and profiled alternatives (unordered_set and bitset).

Week 4

Monday–Thursday: Had to do some homework for school.
Friday: Replaced the bu_ptbl iteration inside bool_eval with bu_bitv. Currently, rt_boolfinal is slower because of bit vector initialisation, but bool_eval went from 14 seconds to 12 seconds. A lot of time is still spent checking if solids are inside the partition (as expected from the previous graphs), but each check is faster.

After that, I used bitset and noted some benchmarks here. Initialising the bitset is cheap, but the set method still shows up in profiles, likely because it performs bounds checking, and operator[] cannot be used for setting a bit (only getting).

I submitted a patch with my changes to bool.c.
Saturday: Fixed compilation of librt on my Windows setup. Submitted a patch.
Sunday: Ran the benchmark script on Release builds with and without my changes replacing bu_ptbl iteration (results: trunk and local changes).

Also cleaned up an earlier patch and fixed compilation of mged on my Windows setup.

Week 5

Monday: Tried a new way to cache results of bu_ptbl iteration. It's faster on some benchmarks (e.g., moss), but way worse on some (e.g., world goes from ~2.2 M to ~1.2 M rays/sec). Next, I want to try caching only when the number of segments in a partition passes a threshold, hopefully getting the best of both worlds.
Tuesday: Tried more ways of caching, but none resulted in performance improvements. Looked at the behaviour of the raytracer more, and I think that the next step would be to replace the bu_ptbl of struct segs altogether, instead of augmenting it.
Wednesday: Wrote little code, just devised a plan for the following days. Replace bu_ptbl pt_seglist with each of vector, set, unordered_set, vector<bool>, bu_bitv, map, unordered_map and record the results of running the benchmark 5 times with each.

Hopefully, at least one of them will perform better on all tests, but I can't be sure or anticipate which one (main factors are locality, initialisation, and how big the constant factors are for each operation).

I will write my thoughts in order to rate my understanding. I expect vector to perform better because it's the same as bu_ptbl, but with better micro-optimisations. set and unordered_set have better asymptotic complexities, but worse locality (trees vs contiguous memory) and maybe bigger constant factors for each operation. vector<bool> and bu_bitv should perform worse despite O(1) operations because of expensive initialisation, but I'm testing it in case my intuition is wrong. I don't know about map and unordered_map. I expect their bad locality to make them worse all around, but I'll test them to be thorough.
Thursday: Replaced bu_ptbl with vector. This required changing all the files which use struct partition to be compiled with C++. I spent the entire day on this and it's still not done. I'm getting some undefined symbols when building, and I had to disable a warning which required casting from string literals to const char *.

I'm not sure if went about it the right way. I removed the ray_partition.h header from raytrace.h and included it individually where it's needed, changing the files from .c to .cpp. I also think that making these changes, the API becomes unavailable from C, since now struct partition has STL containers and requires a C++ compiler.

Tomorrow I should be able to build and run the first benchmarks using vector. After I fix the build, it should be straight-forward to test the other data structures.
Friday: Continued fixing build errors with the new code. Because some files were changed to C++, their functions don't have C linkage anymore, so I have to add extern "C" where it's needed.
Saturday: Did not code.
Sunday: Fixed all build errors and benchmarked performance with vector. This showed performance improvements for the first since I've begun.

Week 6

Monday: Benchmarked with unordered_set but stopped short because it slowed everything down considerably (from 0.06 sec for the first item in the benchmark to more than 5 sec). Profiling showed that most of the time is spent in the rehash method of the container. Since unordered_map uses the same data structure, I don't think it's worth testing it.

I can't replace the list of segments with a list of soltabs because other members of the segment structure are needed in other parts of the code. This means that bit vectors cannot replace the list of segments.

In conclusion: there are a large number of partitions, each consisting of little work; using a more sophisticated data structure to turn O(N) to O(log N) gives worse performance since N is small for one partition; a bit vector is not feasible since other information about a segment is needed.
Tuesday: Began reading the code for libbu's red-black tree implementation. I also want to look at the code for std::vector to try and figure out why it's faster than bu_ptbl, maybe I can merge some ideas into libbu.
Wednesday: Benchmarked bu_ptbl using an initial capacity of 16 (down from 64) and it showed results similar to vector without having to use C++. Created a plot (and a version without moss) with the results of benchmarking trunk, vector, bu_ptbl 16.
Thursday: Modified the red-black tree implementation to take a user data argument when walking the tree and modified the raytracer to use red-black trees.
Friday: Got a bu_rb_tree version of bool.c to work and tested it. It performed way worse than everything else, mainly because of acquiring/releasing semaphores. Also submitted three patches.
Weekend: Did not work.

Week 7

Monday: Began integrating the rb3ptr implementation of red-black trees.
Tuesday: Abandoned testing the rb3ptr implementation after talking on Zulip. Spent the day understanding this gist, figuring out how to display Counters in Instruments, and reading more about vectorisation.
Wednesday: Read more about auto-vectorisation in Clang, and devised a plan for working on this aspect of librt. I compiled with vectorisation reports and I'll go through each error in turn, seeing if I can refactor the code to enable vectorisation.
Thursday-Friday: Discussed next steps and focusing on bu_list replacement.
Weekend: Did not work.

Week 8

Monday: Did not work (exam).
Tuesday: Began moving librt's struct region to SoA. Also committed older changes.
Wednesday–Thursday: Haven't done much, as I would like to discuss some things on Zulip before proceeding. I played around with xxhash and read some documentation on Instruments.
Friday: Began going through an article in order to learn more about memory organisation.
Weekend: Did not work.

Week 9

Monday–Friday: Finished reading the article.
Weekend: Did not work.

Week 10

Monday: Ran cachegrind on rt(1) and produced a table sorted by L1 cache misses. Also tried to debug a compilation error (unsuccessfully).
Tuesday: I received feedback on Zulip about my previous approach, so I had to reconsider. I also debugged the compilation error further, and committed some fixes to CMakeLists.txt.
Wednesday: Did not code.
Thursday: Began moving data structures inside libbu to SoA. Made the change to temp.c, and found broken usage of a bu_list inside malloc.c.
Friday: Converted bu_mappedfile to SoA, but some regression tests were failing.

Week 11

Monday–Tuesday: Kept debugging the failing tests.
Wednesday: Finished debugging the failing tests. Found two previous bugs, fixed in r71109 and r71111. Also committed r71110 and r71112. Posted a diff of the work on bu_mappedfile to discuss it further on Zulip.
Thursday–Friday: Fixed compilation with GCC 6.3 under Debian. Submitted small patches fixing other errors.
Saturday: Did not code.
Sunday: Fixed a bug related to select(2) which resulted in rtwizard regression tests timing out on macOS (and possibly BSDs).

Week 12

All week: Spent it trying to fix the GCC 6.3 bug or recreate the bug on a smaller file for easier debug (and maybe submitting a bug report), but no luck. Also, a new release of Debian became available, and I wanted to check if the bug is fixed there, but I couldn't install it (it failed to connect to the server to get its pacakges; I think it was too new, even the links to the ISOs on the main page were broken).

Week 13

Monday–Tuesday: Discussions on Zulip, generated an ICC auto-vectorization report, tried running a program in a VM.
Wednesday: Committed the removal of bu_list from libbu/temp.c.
Thursday: Did not code.
Friday–Saturday: Committed the removal of bu_list from libbu/mappedfile.c.

Week 14

I did not get much work done. I had to perform some duties in the first part of the week, and took a break during the second part.

Week 15

Monday: I removed bu_list from bu_observer, but I can't build or test the code because of a previous patch (posted on Zulip).
Friday: Committed removal of bu_list from bu_observer and bu_hook.
Saturday: Committed removal of bu_list from bu_cmd, fb_obj and bu_redblack.

Project Plan

Background

Want to work with BRL-CAD because I was previously involved as both a student and mentor during GCI (2012 and 2016, respectively).
Interested in working on the performance aspects because they combine low-level programming and algorithms and data structures, two areas I focused on during my undergraduate studies.

Project Summary

Analysing current raytracing performance (by running rt(1) under a profiler) reveals various areas where improvements can be made. Among those, the bu_list and bu_ptbl components of libbu have a noticeable impact on performance.
- Profiling was done using Apple's Instruments on a quad-core i7 laptop running macOS 10.13. BRL-CAD was latest code from SVN, compiled for Release with Apple's clang.
Specifically, rt_boolfinal makes use of bu_ptbl_cat_unique, a function which takes two sets and returns their union. It does so naively in O(|S1|*|S2|), where |S1| and |S2| are the lengths of each set.
- During profiling, time spent inside bu_ptbl_cat_unique seems to consistently be approximately 1/5 of the time spent in bool_eval.
Inside bool_eval, 1/5 of the time is spent searching for an element in a bu_ptbl structure. It does so in linear time, by looking at each element inside the structure.
On Zulip, it was suggested that I should take a top-down approach to optimisations, trying to reduce the number of calls to bool_eval, then improving performance of the remaining calls, and so on.
Other things I should take into account:
- How different setups affect performance (e.g., Visual C++ on Windows, GCC on Mac, GCC on Linux).
- Changes I make to the code must be run against regression tests to be sure correctness is not affected.

Deliverables

Remove any unnecessary work (boolean evaluation) from rt_boolfinal. This will require finding such work, and coming up with ways to remove it.
Considering the overhead of bu_ptbl_cat_unique and bu_list search, data structures with better performance for these operations will almost surely be needed.
Various smaller improvements that might show up.

Development Schedule

During the community bonding period, I plan to find out as much as possible about how BRL-CAD does raytracing and related pieces of code. This will help me understand what code to write later, what data structures are appropriate.
During the first part of the coding session, I plan to focus on the top levels of rt_boolfinal, such as eliminating unnecessary calls to bool_eval.
Next, I will attempt to optimise the remaining bool_eval calls by replacing bu_list and bu_ptbl with better suited data structures.
Lastly, I will take care of loose ends and smaller improvements.
I will make sure that the code compiles and is correct (by running regression tests) as I go. I will also write comments and keep to the coding conventions with each change, instead of cleaning things up at the end.
I plan to have multiple setups so that I can measure the impact my changes have after each step.
- At the very least, I will use macOS 10.13 with both Apple's clang and the latest GCC, and Windows 10 with Visual C++. I could also use Linux with GCC, but that would be run inside a VM and results might not be representative (but they could be evaluated by comparison to previous runs).
- If there are usage statistics for BRL-CAD, such as which platforms are most popular, profiling could be done primarily on those.