Improving raytracing performance by improving libbu
Cezar El-Nazli
Development Log
Go to current week
Week 1
- Created this document; sent agreement email; set up
a development environment on macOS and Windows.
- Submitted
a
patch for
bu_ptbl unit
tests.
Week 2
- Began reading the
librt code in
order to understand how the raytracer works,
starting with structures
(e.g., struct soltab,
struct partition).
- Began writing some notes on
how the parts of
librt relevant
to my project work. I'm still writing it as I learn
more, and I expect there to be some inaccuracies.
After fixing those, if people think it's useful, I
will post it on the wiki.
- Ran rt(1) with
librt debug
flags
(DEBUG_PARTITION)
enabled in order to better understand the code path.
- Figured out how the tree for a region is built.
Week 3
- Discussed the code I read so far and cleared up some
misunderstandings on Zulip.
- Completed the GSoC checklist.
- Created plots of the calls to
bool_eval
(seg.svg and
sol.svg) in order to
better understand its behaviour.
- Understood the current issues with
bu_bitv and
profiled alternatives
(unordered_set
and bitset).
Week 4
- Monday–Thursday: Had to do some homework for
school.
- Friday: Replaced the
bu_ptbl iteration
inside bool_eval
with bu_bitv.
Currently,
rt_boolfinal
is slower because of bit vector initialisation,
but bool_eval
went from 14 seconds to 12 seconds. A lot of time is
still spent checking if solids are inside the
partition (as expected from the previous graphs),
but each check is faster.
After that, I used
bitset and
noted some benchmarks
here. Initialising the bitset is cheap, but the
set method still
shows up in profiles, likely because
it
performs bounds checking, and
operator[]
cannot be used for setting a bit (only getting).
I submitted
a patch with my changes to
bool.c.
-
Saturday: Fixed compilation of
librt on my
Windows setup.
Submitted a patch.
-
Sunday: Ran the benchmark script on Release builds
with and without my changes replacing
bu_ptbl
iteration (results:
trunk and
local changes).
Also cleaned up an earlier patch and fixed
compilation of
mged on my
Windows setup.
Week 5
- Monday: Tried a new way to cache results of
bu_ptbl
iteration. It's faster on some benchmarks (e.g.,
moss), but way
worse on some (e.g.,
world goes from
~2.2 M to ~1.2 M rays/sec). Next, I want to try
caching only when the number of segments in a
partition passes a threshold, hopefully getting the
best of both worlds.
- Tuesday: Tried more ways of caching, but none
resulted in performance improvements. Looked at the
behaviour of the raytracer more, and I think that
the next step would be to replace the
bu_ptbl of
struct segs
altogether, instead of augmenting it.
- Wednesday: Wrote little code, just devised a plan
for the following days. Replace
bu_ptbl pt_seglist
with each of
vector,
set,
unordered_set,
vector<bool>,
bu_bitv,
map,
unordered_map
and record the results of running the benchmark 5
times with each.
Hopefully, at least one of them will perform better
on all tests, but I can't be sure or anticipate
which one (main factors are locality,
initialisation, and how big the constant factors are
for each operation).
I will write my thoughts in order to rate my
understanding. I expect
vector to
perform better because it's the same as
bu_ptbl, but
with better micro-optimisations.
set and
unordered_set
have better asymptotic complexities, but worse
locality (trees vs contiguous memory) and maybe
bigger constant factors for each operation.
vector<bool>
and bu_bitv
should perform worse despite O(1) operations
because of expensive initialisation, but I'm testing
it in case my intuition is wrong. I don't know about
map
and
unordered_map.
I expect their bad locality to make them worse all
around, but I'll test them to be thorough.
- Thursday: Replaced
bu_ptbl with
vector.
This required changing all the files which use
struct partition
to be compiled with C++. I spent the entire day
on this and it's still not done. I'm getting
some undefined symbols when building, and I had
to disable a warning which required casting from
string literals to
const char *.
I'm not sure if went about it the right way. I
removed the
ray_partition.h
header from
raytrace.h
and included it individually where it's needed,
changing the files from
.c to
.cpp.
I also think that making these changes, the API
becomes unavailable from C, since now
struct partition
has STL containers and requires a C++ compiler.
Tomorrow I should be able to build and run the
first benchmarks using
vector.
After I fix the build, it should be straight-forward
to test the other data structures.
- Friday: Continued fixing build errors with the new
code. Because some files were changed to C++, their
functions don't have C linkage anymore, so I have to
add
extern "C"
where it's needed.
- Saturday: Did not code.
- Sunday: Fixed all build errors and benchmarked
performance with
vector. This
showed performance improvements for the first
since I've begun.
Week 6
- Monday: Benchmarked with
unordered_set
but stopped short because it slowed everything down
considerably (from 0.06 sec for the first item in
the benchmark to more than 5 sec). Profiling
showed that most of the time is spent in the
rehash method of
the container. Since
unordered_map
uses the same data structure, I don't think it's
worth testing it.
I can't replace the list of segments with a list of
soltabs because
other members of the segment structure are needed
in other parts of the code. This means that bit
vectors cannot replace the list of segments.
In conclusion: there are a large number of
partitions, each consisting of little work; using a
more sophisticated data structure to turn O(N) to
O(log N) gives worse performance since N is
small for one partition; a bit vector is not
feasible since other information about a segment is
needed.
- Tuesday: Began reading the code for libbu's
red-black tree implementation. I also want to look
at the code for
std::vector to
try and figure out why it's faster than
bu_ptbl, maybe I
can merge some ideas into libbu.
- Wednesday: Benchmarked
bu_ptbl using an
initial capacity of 16 (down from 64) and it showed
results similar to
vector without
having to use C++. Created
a plot (and
a version without
moss) with
the results of
benchmarking
trunk,
vector,
bu_ptbl 16.
- Thursday: Modified the red-black tree implementation
to take a user data argument when walking the tree
and modified the raytracer to use red-black trees.
- Friday: Got a
bu_rb_tree
version of
bool.c to work
and tested it. It performed way worse than
everything else, mainly because of
acquiring/releasing semaphores. Also submitted
three patches.
- Weekend: Did not work.
Week 7
- Monday: Began integrating the rb3ptr implementation of
red-black trees.
- Tuesday: Abandoned testing the rb3ptr implementation
after talking on Zulip. Spent the day understanding
this gist,
figuring out how to display Counters in Instruments,
and reading more about vectorisation.
- Wednesday: Read more about auto-vectorisation in
Clang, and devised a plan for working on this aspect
of librt. I
compiled with vectorisation reports and I'll go
through each error in turn, seeing if I can refactor
the code to enable vectorisation.
- Thursday-Friday: Discussed next steps and focusing
on bu_list
replacement.
- Weekend: Did not work.
Week 8
- Monday: Did not work (exam).
- Tuesday: Began moving
librt's
struct region
to SoA. Also committed older changes.
- Wednesday–Thursday: Haven't done much, as I
would like to discuss some things on Zulip before
proceeding. I played around with
xxhash
and read some documentation on Instruments.
- Friday: Began going through
an article
in order to learn more about memory
organisation.
- Weekend: Did not work.
Week 9
- Monday–Friday: Finished reading the article.
- Weekend: Did not work.
Week 10
- Monday: Ran
cachegrind on
rt(1) and
produced a table sorted by L1
cache misses. Also tried to debug a compilation
error (unsuccessfully).
- Tuesday: I received feedback on Zulip about my
previous approach, so I had to reconsider. I also
debugged the compilation error further, and
committed some fixes to
CMakeLists.txt.
- Wednesday: Did not code.
- Thursday: Began moving data structures inside
libbu to SoA.
Made the change to
temp.c, and
found broken usage of a
bu_list inside
malloc.c.
- Friday: Converted
bu_mappedfile to
SoA, but some regression tests were failing.
Week 11
- Monday–Tuesday: Kept debugging the failing tests.
- Wednesday: Finished debugging the failing tests.
Found two previous bugs, fixed in r71109 and r71111.
Also committed r71110 and r71112. Posted
a diff of the work on
bu_mappedfile to discuss it further on Zulip.
- Thursday–Friday: Fixed compilation with GCC
6.3 under Debian. Submitted small patches fixing
other errors.
- Saturday: Did not code.
- Sunday: Fixed a bug related to
select(2) which
resulted in
rtwizard
regression tests timing out on macOS (and possibly
BSDs).
Week 12
-
All week: Spent it trying to fix the GCC 6.3 bug or
recreate the bug on a smaller file for easier debug (and
maybe submitting a bug report), but no luck. Also, a new
release of Debian became available, and I wanted to
check if the bug is fixed there, but I couldn't install
it (it failed to connect to the server to get its
pacakges; I think it was too new, even the links to the
ISOs on the main page were broken).
Week 13
- Monday–Tuesday: Discussions on Zulip, generated
an ICC auto-vectorization report, tried running a program
in a VM.
- Wednesday: Committed the removal of
bu_list from
libbu/temp.c.
- Thursday: Did not code.
- Friday–Saturday: Committed the removal of
bu_list from
libbu/mappedfile.c.
Week 14
- I did not get much work done. I had to perform some
duties in the first part of the week, and took a
break during the second part.
Week 15
- Monday: I removed
bu_list from
bu_observer, but
I can't build or test the code because of a previous
patch (posted on Zulip).
- Friday: Committed removal of
bu_list from
bu_observer and
bu_hook.
- Saturday: Committed removal of
bu_list from
bu_cmd,
fb_obj and
bu_redblack.
Project Plan
Background
- Want to work with BRL-CAD because I was previously
involved as both a student and mentor during GCI
(2012 and 2016, respectively).
- Interested in working on the performance aspects
because they combine low-level programming and
algorithms and data structures, two areas I focused
on during my undergraduate studies.
Project Summary
- Analysing current raytracing performance (by running
rt(1) under
a profiler) reveals various areas where improvements
can be made. Among those, the
bu_list and
bu_ptbl
components of libbu have a noticeable impact on
performance.
- Profiling was done using Apple's Instruments on a
quad-core i7 laptop running macOS 10.13. BRL-CAD was
latest code from SVN, compiled for Release with Apple's
clang.
- Specifically, rt_boolfinal makes use
of bu_ptbl_cat_unique,
a function which takes two sets and returns their union. It does
so naively in O(|S1|*|S2|), where
|S1| and |S2| are the lengths of each set.
- During profiling, time spent inside
bu_ptbl_cat_unique
seems to consistently be approximately 1/5 of the time
spent in bool_eval.
- Inside bool_eval, 1/5 of the time is spent searching for an
element in a bu_ptbl structure. It
does so in linear time, by
looking at each element inside the structure.
- On Zulip, it was suggested that I should take a top-down
approach to optimisations, trying to reduce the number of calls
to bool_eval, then improving
performance of the remaining calls, and so on.
- Other things I should take into account:
- How different setups affect performance (e.g., Visual
C++ on Windows, GCC on Mac, GCC on Linux).
- Changes I make to the code must be run against
regression tests to be sure correctness is not
affected.
Deliverables
- Remove any unnecessary work (boolean evaluation) from
rt_boolfinal. This will require
finding such work, and coming up with ways to remove it.
- Considering the overhead of
bu_ptbl_cat_unique
and bu_list
search, data structures with better performance for
these operations will almost surely be needed.
- Various smaller improvements that might show up.
Development Schedule
- During the community bonding period, I plan to find
out as much as possible about how BRL-CAD does
raytracing and related pieces of code. This will
help me understand what code to write later, what
data structures are appropriate.
- During the first part of the coding session, I plan
to focus on the top levels of
rt_boolfinal,
such as eliminating unnecessary calls to
bool_eval.
- Next, I will attempt to optimise the remaining
bool_eval
calls by replacing
bu_list and
bu_ptbl with
better suited data structures.
- Lastly, I will take care of loose ends and smaller
improvements.
- I will make sure that the code compiles and is
correct (by running regression tests) as I go. I
will also write comments and keep to the coding
conventions with each change, instead of cleaning
things up at the end.
- I plan to have multiple setups so that I can measure
the impact my changes have after each step.
- At the very least, I will use macOS 10.13
with both Apple's clang and the latest GCC,
and Windows 10 with Visual C++. I could also
use Linux with GCC, but that would be run
inside a VM and results might not be
representative (but they could be evaluated
by comparison to previous runs).
- If there are usage statistics for BRL-CAD,
such as which platforms are most popular,
profiling could be done primarily on
those.