PS3 Ray tracer Real time ray tracing on the Cell Broadband Engine

This is my dissertation for MSc Games Programming at the University of Hull. Given a list of dissertations to pick from, despite a few being very interesting, none of the projects involved the PS3 tool which I found fascinating. Therefore, instead of picking one from the list, I came up with my own idea of making a ray tracer on the PS3. This was quite ambitious as I had never programmed on the PS3 before nor had I made a ray tracer before.

My main goal was to make the PS3 program fully parallel and use the entire cell broad band engine (CBE). If you are unfamiliar with the PS3's architecture a brief overview can be found here. If you would like more detail about this project feel free to contact me.


Here we have a Cornell box that is commonly used to demonstrate reflections and refractions. This scene runs at ~2.3 frames per second (FPS). The program features:
• Reflections (left most sphere).
• Refractions with Beer's law (large right sphere).
• Phong with Schlick's approximation (diffuse on the red near sphere, specular on large right sphere).
• Hard shadows.
• Aspect ratio correction.
• Sphere and plane primitives.
• BVH acceleration structure.
• Adaptive depth cut off.
• Full use of all the PS3's SPUs and PPU with load balancing.
• PS3 controller use enabling;
    • Change ray depth cut off.
    • Toggle reflections.
    • Toggle refractions.
    • Pitch and yaw the camera.
    • Move the position of the camera.


This scene is made to demonstrate the colours being added together in reflections. The spheres are arranged in a six sided colour wheel. They gain colour(s) in the reflections of eachother. For example, a reflection of a white sphere on a red sphere would appear "pinky". This scene runs at ~2.9 FPS.


This scene also shows reflections, but its main purpose is to show it running at higher speeds. It runs at ~7.8 FPS. Considering the screen resolution is 1280x720, and the PS3 is clocked at 3.2 GHz, these frame rates are respectable.

Ray Tracing Optimisations

If you were to profile a naive ray tracer, you would find that most of the time is spent checking if rays collide with objects, since the ray tracer would be using a brute force method of comparing every ray with every primitive in the scene: . This is extremely expensive and only gets worse the more primitives you have. There are three main categories of optimisation: faster ray primitive intersections / less intersection tests, fewer rays and generalizing rays. The optimisations featured in my ray tracer are as follows:

Acceleration structures (less intersection tests)

There are many different acceleration structures all with their own advantages and disadvantages. However, for the PS3 it was required to have a small memory footprint to avoid memory accessing as much as possible, and to make sure the structures would fit on the SPUs which ensures a good compute to memory ratio. After much research, a Bounding Volume Hierarchy was chosen and implemented. BVHs are a bit like bounding volumes in a physics engine. Each object is enveloped in a bigger object that is faster to test for collision against. In addition to those bounding volumes, why not have an even larger bounding volume encapsulating them? This is exactly what a BVH is. A ray is tested against the highest level of the hierarchy, if it collides; the ray is then tested against the child volumes. This is repeated until we have a collision with the actual primitive and thus severely reducing the number of intersection tests. The BVH works at so it scales quite nicely.

BVH bottom layer


BVHs are very complicated and time consuming to implement. It could have easily taken the whole duration of the project to make a particularly effective one. Therefore, my implementation was kept relatively simple. The image above shows the lowest layer of the BVH in action on the testing scene. The green spheres are the test primitives that have the bound volume created for them. The purple (actually blue, but red gets added to their colour) larger spheres around the green spheres are their bounding volumes at the lowest level of the BVH. Above them are the purple nodes bounding volumes which are the even larger red spheres. The turquoise sphere, in the centre of the image, is the calculated centre of the scene (average of all primitive positions) it is used as a reference for the top layer.

BVH top layer


The top layer is created based on the centre of the scene as a reference to split the world into eight giant zones similar to the top layer of an Oct tree. If any nodes are present in that zone, then a bounding volume (node) is made for them, and they all become sub nodes to it. The yellow spheres in the image are these zones / top layer nodes. So to conclude, the BVH is created before ray tracing begins, and doesn't change as long as the scene is static. Then each ray is tested against the BVH. The tests would occur in this order: firstly the yellow spheres are tested against, next the red spheres, then the purple(ish) spheres and finally the primitives inside the purple spheres to get the collided object. This also works for rays coming from inside a bounding volume. Therefore, this method not only improves the primary rays but secondary rays such as reflections and refractions.

Adaptive depth cutoff (fewer rays)

The ray's impact on the image is calculated for each ray. Then if it's too small its result wouldn't be visible on the image then the ray isn't cast.
For example, let Kr = 0.5 be the reflective index of an object, so the maximum contribution to that ray's colour is 0.5. As the ray hits a reflective object, a reflective ray is created. Now assume this ray collides with another reflective object with the same reflective index of 0.5. To calculate the maximum contribution to the primary ray's colour, we multiply the current contribution by the current object's reflective index; in this case, 0.5 * 0.5 = 0.25. We can continue in this manner, going deeper and deeper, casting more and more rays. For example 0.25 * 0.5 = 0.125, 0.125 * 0.5 = 0.0625 and so on, with each ray having less and less of a contribution to the colour of the ray.

Parallelising a Ray Tracer on the PS3

There are two approaches to concurrency:
Heterogeneous - each SPE would run a separate part of the ray tracing algorithm (like a pipeline) and passes on the results to whichever SPE's job it is to carry on. i.e. primary ray SPE passes work to reflection ray SPE etc.
Homogenous - each SPE runs a whole ray tracer and each SPE operates on a different part of the image.

Heterogeneous works very well for streams of media, because the amount of data is consistent. However, with a ray tracer this is not true. For example, there would be many primary rays, some reflection rays, even less refraction rays, plenty of shadow rays all of varying amount based on the scene. Creating a pipeline with dynamic task allocations for each SPE to maintain good workloads between the SPEs would be time consuming and unnecessary since despite the interconnection speed between the SPEs being very high it's still more efficient to not intercommunicate at all. This leads to Homogenous being most appropriate for the task at hand.

Implementation

As outlined above a homogenous approch was most appropiate and was taken; each of the five SPE's available to the programmer has its own ray tracer. At the start of the program the BVH is constructed by the PPE which then passes the constructed BVH to each of the SPE's via a DMA transfer. Each SPE was running its ray tracer using a "task list" which is quite similar to a thread in that it can run for the entirety of the program and is suited to larger grain tasks. Since each SPE had its own ray tracer in a thread like object, manual synchronisation had to be implemented. This was the core point of the dissertation and great joy was taken implementing this part.

Synchronisation

Each SPE had to somehow know what part of the image to work on. This was achieved by the image conceptually being split up into a grid. Each SPE had its own memory for output, each SPE would access a variable that they all share (libatomic variable) to check what part of the image wasn't yet processed. This can be visualised like so:


Here we have an image of the refraction scene with a grid (to scale) drawn on top. The number in the cells is that grid's ID number. The shared variable represents the current progress on the image. For example if the variable was twenty one, everything before it (twenty one) has been processed. Each SPE would check the current progress by accessing this variable (safely via libatomic) and then increment it by the number of cells it takes. The number of cells was changeable because each SPE should be busy for as long as possible and should transfer data infrequently as possible. On the other hand, each SPE has a small memory capacity and can only buffer so much output data for syncing with the image. Therefore I implemented it so that it was possible to change the number of blocks / cells (area of the image) each SPE takes, so that the optimum number of cells (area of image) per SPE could be found.

For example: the start of a frame would happen like so, SPE 1 would come along and see that nought is the next free cell. It will then change the current free cell to four. Next, SPE 2 will come along and see that cell number four is free and increment the current free cell to eight and so on until the image is finished. When a SPE finishes its cells it informs the PPE that it's done its job and copies its output buffers into a buffer on the PPE which then updates the final image. After that, it goes back for another task.

Load balancing

Each SPE's task could take a different amount of time based on its contents. This means load balancing is required. However, with this implementation this is already occurring (which was what brought around the idea of this implementation). Since the SPEs check for the current free cell it makes no difference if some of the previous cells are taking a long time to finish. For example, say SPE 1 got four cells all in the reflective sphere on the right. Those four cells will take much longer than all other cells. But this doesn't matter since SPE 2 - 5 are still busy working as fast as they can. They will do the almost empty cells that contain the background plane very quickly, thus catching up for SPE 1's busy time (in terms of cell progress). Each SPE might not do the same amount of cells but will do almost the same amount of work.

End of frame synchronisation

Each SPE communicates with the PPE via an eventqueue. When its cells are done, it queues an event for the PPE event thread to inform the PPE that it has data to be copied into the main image. The PPE then sees this event, and copies the data out the buffer into the image buffer. This continues until the image is complete. Then the SPEs notice there are no more unprocessed cells for them to work on, so they send an event to the PPE saying "I'm done!", then hit a barrier (barrier synchronisation) and wait. Once the PPE has received this message from all the active SPEs, the frame is finished and rendered. The PPE then continues to wait for more data from the SPEs and the whole process starts again. The barrier synchronisation that the SPURs library provides is great, the PPE doesn't have to know what's going on in the SPEs since the barrier is for the SPEs, they pause and continue all on their own. The only intercommunication between the PPE and the SPEs is the eventqueue to inform the PPE that a DMA transfer has occurred.

PS3 optimisations

Some optimisations are unique to the PS3, they are part of the PS3 SDK and apply to bottlenecks unique to the PS3. The ones used in the program are detailed below:

Using the PPE for more

The PPE can run several threads and its context switch is much faster than of the SPEs. While the SPEs are busy its job is to copy buffered data from the SPEs into the frame buffer and render a texture. This isn't really that demanding for the PPE. So I made a PPE thread that acted just like a one of the SPEs (looks at the current free cell and processes it). It required accessing variables on the SPEs via the PPE and some additional synchronisation checks on the PPE but improved performance.

24 bit colours in 32 bit buffer?

I noticed that my colours are 24 bits (RGB) then 0 was bit shifted in the remaining 8 bits. Then this was copied via DMA onto the PPE. "hmm... I have wasted some bytes there," I thought. I changed the size of the SPE buffers to accommodate more pixels, and added the bit shifting of 0s to the PPE. There was a large improvement in performance, since less cells were on the screen less DMA transfers would occur, thus keeping the SPEs hard at work for longer.

Improved branch prediction

Failing to take a predicted branch on the SPEs is quite expensive on the PS3. But the SDK has a trick to alleviate this; a special function in C (with inline assembly) that takes a comparison as an argument and predicts the branch before the SPE starts to process the branch. Which I find quite amazing. Firstly, I rearranged my code to improve branch prediction success, then added this special function at the most problematic branches and the frame rate went up by ~0.2 frames.

SIMD

The PS3 has its own implementation of SIMD. I had begun preparing the code to convert it to SIMD even though it wasn't part of my original specifications. To my disappointment, I fell short of time, which was clear on my Gantt chart, but I wanted to add it anyway. Its speed gain would have been very impressive since each SPE would increase speed 2 to 6 times.