Improving raytracing performance by improving libbu
Go to current week
- Created this document; sent agreement email; set up
a development environment on macOS and Windows.
- Began reading the
librt code in
order to understand how the raytracer works,
starting with structures
(e.g., struct soltab,
- Began writing some notes on
how the parts of
to my project work. I'm still writing it as I learn
more, and I expect there to be some inaccuracies.
After fixing those, if people think it's useful, I
will post it on the wiki.
- Ran rt(1) with
enabled in order to better understand the code path.
- Figured out how the tree for a region is built.
- Discussed the code I read so far and cleared up some
misunderstandings on Zulip.
- Completed the GSoC checklist.
- Created plots of the calls to
sol.svg) in order to
better understand its behaviour.
- Understood the current issues with
- Monday–Thursday: Had to do some homework for
- Friday: Replaced the
is slower because of bit vector initialisation,
went from 14 seconds to 12 seconds. A lot of time is
still spent checking if solids are inside the
partition (as expected from the previous graphs),
but each check is faster.
After that, I used
noted some benchmarks
here. Initialising the bitset is cheap, but the
set method still
shows up in profiles, likely because
performs bounds checking, and
cannot be used for setting a bit (only getting).
a patch with my changes to
Saturday: Fixed compilation of
librt on my
Submitted a patch.
Sunday: Ran the benchmark script on Release builds
with and without my changes replacing
Also cleaned up an earlier patch and fixed
mged on my
- Monday: Tried a new way to cache results of
iteration. It's faster on some benchmarks (e.g.,
moss), but way
worse on some (e.g.,
world goes from
~2.2 M to ~1.2 M rays/sec). Next, I want to try
caching only when the number of segments in a
partition passes a threshold, hopefully getting the
best of both worlds.
- Tuesday: Tried more ways of caching, but none
resulted in performance improvements. Looked at the
behaviour of the raytracer more, and I think that
the next step would be to replace the
altogether, instead of augmenting it.
- Wednesday: Wrote little code, just devised a plan
for the following days. Replace
with each of
and record the results of running the benchmark 5
times with each.
Hopefully, at least one of them will perform better
on all tests, but I can't be sure or anticipate
which one (main factors are locality,
initialisation, and how big the constant factors are
for each operation).
I will write my thoughts in order to rate my
understanding. I expect
perform better because it's the same as
with better micro-optimisations.
have better asymptotic complexities, but worse
locality (trees vs contiguous memory) and maybe
bigger constant factors for each operation.
should perform worse despite O(1) operations
because of expensive initialisation, but I'm testing
it in case my intuition is wrong. I don't know about
I expect their bad locality to make them worse all
around, but I'll test them to be thorough.
- Thursday: Replaced
This required changing all the files which use
to be compiled with C++. I spent the entire day
on this and it's still not done. I'm getting
some undefined symbols when building, and I had
to disable a warning which required casting from
string literals to
const char *.
I'm not sure if went about it the right way. I
and included it individually where it's needed,
changing the files from
I also think that making these changes, the API
becomes unavailable from C, since now
has STL containers and requires a C++ compiler.
Tomorrow I should be able to build and run the
first benchmarks using
After I fix the build, it should be straight-forward
to test the other data structures.
- Friday: Continued fixing build errors with the new
code. Because some files were changed to C++, their
functions don't have C linkage anymore, so I have to
where it's needed.
- Saturday: Did not code.
- Sunday: Fixed all build errors and benchmarked
showed performance improvements for the first
since I've begun.
- Monday: Benchmarked with
but stopped short because it slowed everything down
considerably (from 0.06 sec for the first item in
the benchmark to more than 5 sec). Profiling
showed that most of the time is spent in the
rehash method of
the container. Since
uses the same data structure, I don't think it's
worth testing it.
I can't replace the list of segments with a list of
other members of the segment structure are needed
in other parts of the code. This means that bit
vectors cannot replace the list of segments.
In conclusion: there are a large number of
partitions, each consisting of little work; using a
more sophisticated data structure to turn O(N) to
O(log N) gives worse performance since N is
small for one partition; a bit vector is not
feasible since other information about a segment is
- Tuesday: Began reading the code for libbu's
red-black tree implementation. I also want to look
at the code for
try and figure out why it's faster than
bu_ptbl, maybe I
can merge some ideas into libbu.
- Wednesday: Benchmarked
bu_ptbl using an
initial capacity of 16 (down from 64) and it showed
results similar to
having to use C++. Created
a plot (and
a version without
the results of
- Thursday: Modified the red-black tree implementation
to take a user data argument when walking the tree
and modified the raytracer to use red-black trees.
- Friday: Got a
bool.c to work
and tested it. It performed way worse than
everything else, mainly because of
acquiring/releasing semaphores. Also submitted
- Weekend: Did not work.
- Monday: Began integrating the rb3ptr implementation of
- Tuesday: Abandoned testing the rb3ptr implementation
after talking on Zulip. Spent the day understanding
figuring out how to display Counters in Instruments,
and reading more about vectorisation.
- Wednesday: Read more about auto-vectorisation in
Clang, and devised a plan for working on this aspect
of librt. I
compiled with vectorisation reports and I'll go
through each error in turn, seeing if I can refactor
the code to enable vectorisation.
- Thursday-Friday: Discussed next steps and focusing
- Weekend: Did not work.
- Monday: Did not work (exam).
- Tuesday: Began moving
to SoA. Also committed older changes.
- Wednesday–Thursday: Haven't done much, as I
would like to discuss some things on Zulip before
proceeding. I played around with
and read some documentation on Instruments.
- Friday: Began going through
in order to learn more about memory
- Weekend: Did not work.
- Monday–Friday: Finished reading the article.
- Weekend: Did not work.
- Monday: Ran
produced a table sorted by L1
cache misses. Also tried to debug a compilation
- Tuesday: I received feedback on Zulip about my
previous approach, so I had to reconsider. I also
debugged the compilation error further, and
committed some fixes to
- Wednesday: Did not code.
- Thursday: Began moving data structures inside
libbu to SoA.
Made the change to
found broken usage of a
- Friday: Converted
SoA, but some regression tests were failing.
- Monday–Tuesday: Kept debugging the failing tests.
- Wednesday: Finished debugging the failing tests.
Found two previous bugs, fixed in r71109 and r71111.
Also committed r71110 and r71112. Posted
a diff of the work on
bu_mappedfile to discuss it further on Zulip.
- Thursday–Friday: Fixed compilation with GCC
6.3 under Debian. Submitted small patches fixing
- Saturday: Did not code.
- Sunday: Fixed a bug related to
regression tests timing out on macOS (and possibly
All week: Spent it trying to fix the GCC 6.3 bug or
recreate the bug on a smaller file for easier debug (and
maybe submitting a bug report), but no luck. Also, a new
release of Debian became available, and I wanted to
check if the bug is fixed there, but I couldn't install
it (it failed to connect to the server to get its
pacakges; I think it was too new, even the links to the
ISOs on the main page were broken).
- Monday–Tuesday: Discussions on Zulip, generated
an ICC auto-vectorization report, tried running a program
in a VM.
- Wednesday: Committed the removal of
- Thursday: Did not code.
- Friday–Saturday: Committed the removal of
- I did not get much work done. I had to perform some
duties in the first part of the week, and took a
break during the second part.
- Monday: I removed
I can't build or test the code because of a previous
patch (posted on Zulip).
- Friday: Committed removal of
- Saturday: Committed removal of
- Want to work with BRL-CAD because I was previously
involved as both a student and mentor during GCI
(2012 and 2016, respectively).
- Interested in working on the performance aspects
because they combine low-level programming and
algorithms and data structures, two areas I focused
on during my undergraduate studies.
- Analysing current raytracing performance (by running
a profiler) reveals various areas where improvements
can be made. Among those, the
components of libbu have a noticeable impact on
- Profiling was done using Apple's Instruments on a
quad-core i7 laptop running macOS 10.13. BRL-CAD was
latest code from SVN, compiled for Release with Apple's
- Specifically, rt_boolfinal makes use
a function which takes two sets and returns their union. It does
so naively in O(|S1|*|S2|), where
|S1| and |S2| are the lengths of each set.
- During profiling, time spent inside
seems to consistently be approximately 1/5 of the time
spent in bool_eval.
- Inside bool_eval, 1/5 of the time is spent searching for an
element in a bu_ptbl structure. It
does so in linear time, by
looking at each element inside the structure.
- On Zulip, it was suggested that I should take a top-down
approach to optimisations, trying to reduce the number of calls
to bool_eval, then improving
performance of the remaining calls, and so on.
- Other things I should take into account:
- How different setups affect performance (e.g., Visual
C++ on Windows, GCC on Mac, GCC on Linux).
- Changes I make to the code must be run against
regression tests to be sure correctness is not
- Remove any unnecessary work (boolean evaluation) from
rt_boolfinal. This will require
finding such work, and coming up with ways to remove it.
- Considering the overhead of
search, data structures with better performance for
these operations will almost surely be needed.
- Various smaller improvements that might show up.
- During the community bonding period, I plan to find
out as much as possible about how BRL-CAD does
raytracing and related pieces of code. This will
help me understand what code to write later, what
data structures are appropriate.
- During the first part of the coding session, I plan
to focus on the top levels of
such as eliminating unnecessary calls to
- Next, I will attempt to optimise the remaining
calls by replacing
better suited data structures.
- Lastly, I will take care of loose ends and smaller
- I will make sure that the code compiles and is
correct (by running regression tests) as I go. I
will also write comments and keep to the coding
conventions with each change, instead of cleaning
things up at the end.
- I plan to have multiple setups so that I can measure
the impact my changes have after each step.
- At the very least, I will use macOS 10.13
with both Apple's clang and the latest GCC,
and Windows 10 with Visual C++. I could also
use Linux with GCC, but that would be run
inside a VM and results might not be
representative (but they could be evaluated
by comparison to previous runs).
- If there are usage statistics for BRL-CAD,
such as which platforms are most popular,
profiling could be done primarily on