SIMD-pathtracer

Project Proposal: SIMD Parallelization of a Path Tracer

Authors: Michelle Ma, Rain Du

Summary

We plan to modify a sequential path tracer to utilize SIMD parallelization and achieve as much speedup as we can. Our baseline goal is to accelerate it using ISPC, and our stretch goal - if time permits - is to port most computation to CUDA. We will be working and benchmarking on two Macbook Pro laptops when using ISPC. If we later start working with CUDA, we will move to the GHC machines that have RTX 2080.

Background

Path tracing is a classic and well-developed method to render convincing images of 3D scenes based on their mathematical descriptions. Path tracing is widely used in industries like animation, due to the high fidelity images it can generate, but its high computational cost prevented it from being used in many other - mostly interactive - applications. Fortunately the algorithm itself does not impose much data dependency, allowing people to continue researching ways to accelerate path tracing through parallelization. Historically the most common way to parallelize a path tracer is to run it across multiple CPU processors, but utilizing SIMD parallelization is also an attractive direction.

Below is a summarization of the path tracing algorithm and an overview of our existing sequential implementation.

Algorithm Overview

The path tracer generates camera rays originated at the camera position and directed toward the pixels of the virtual image plane. It then returns the sum of (1) the emmisive color of the primitive that it hits and (2) the reflected/refracted color onto the same point on the primitive. Retrieving (1) needs a query to the primitive’s material, and calculating (2) requires integrating over the hemisphere around the hit normal for the indirect light contribution. For a Monte-Carlo path tracer like ours, the integration is calculated through Monte-Carlo integration - for every ray hit where we need to integrate for indirect lighting, we generate another ray as a sample originated at the hit point, and trace the new ray to get an approximation. The algorithm is thus inherently recursive. The recursion stops either when the ray doesn’t hit any scene primitive, when reaches a certain upper bound of some number of recursive calls (maximum ray bounces), or when the subsequent rays’ contribution is below some threshold (Russian Roulette).

In our original sequential implementation, we also generate rays from camera ray hit points directly toward light sources in the scenes in order to reduce noise, but we might exclude this optimization for the sake of simplicity.

Sequential Implementation Code Structure

This is a pseudo-code summary of the key function trace_ray that we need to parallelize:

Color trace_ray(Ray& ray)
{
    if (ray hits a scene primitive P)
    {
        Color L = Color(0);
        L += P's emmisive color;
        Potentially terminate the trace due to Russian Roulette.
        if (!terminated)
        {
            Ray ray_next = ray that originates at the hit point 
                and points toward a reflected/refracted direction,
                sampled based on P's material (BSDF)
            L += (trace_ray(ray_next), weighed and adjusted depending on P's
                material, ray_next's angle with hit normal, Russian Roulette,
                etc.)
        }
        return L;
    }
    else return Color(0)
}

trace_ray is a recursive function that takes a generated ray as input (initially one of the camera rays), recursively calls itself to calculate indirect lighting, and returns a color of that pixel. Specifically, this is not a tail recursion; the recursive caller’s return value depends on what its callee returns. We might have to work around this dependency so that a chain of recursive calls can be broken into multiple “tasks”.

The Challenge

As described above, the ray tracing algorithm traces rays backwards, starting from the camera positions and projecting a ray in the direction of each pixels, and traces through each refraction and reflection events when hitting the objects in the 3D space until the ray reaches the maximum number of bounces, does not hit any scene primitives, or terminated by Russian Roulette due to its small contribution to the pixel color. For each ray that hits an object and is not terminated, there is a newly generated ray being used for Monte-Carlo integration approximation of reflected and refracted light. The result of subsequent ray might influence the final returned pixel color values. In the sequential version, we used a recursive implementation to demonstrate such relationships between pixel values and rays, which results in significant dependencies among different rays when calculating the image values for a pixel, making SIMD parallelization hard to implement.

Moreover, considering the packet approach we discussed in class, we found that the vector utilization in SIMD can possibly be a major challenge. Suppose each packet contains a group of rays. In any group, because of the drastic distinction of paths between even two rays that are close together and have similar directions at the beginning, the rays in a single packet, which are performed in a sequence of SIMD instructions, can easily lead vector lanes to be idle if the number of reflection and refraction events along the paths of rays differ greatly, with some vector lanes finishing up much earlier than the others. To finish tracing, those rays who have finished early but in the same packet must wait for everyone else to finish tracing, leading to poor vector lane utilization.

Additionally, locality can be another challenge. Since the path of each ray is highly unpredictable and varies greatly with direction, origins, as well as the textures of the objects in the scene, the rays whose starting points are close together are not assumed to request similar data for computations. Therefore, the locality issues might be significant, and this is similar to the Galaxy model that we discussed except that the entities in the galaxy are moving much faster (rays that are close together might change for every reflection and refraction steps).

In terms of the constraints of the problem, the imbalanced mapping of computation onto different cores can also be another challenge. Consider an image with a shiny glass ball at a corner of the image. In this case, if we statically partition the image into blocks or rows, the mapping might not be balanced as only the rays from the locations of a small number of pixels (close to where the ball the locates) might experience significantly more reflection and refraction events, leading to higher computational cost. Therefore, we might need to map work to different cores dynamically or semi-dynamically, which can result in significant overhead if the granularity is not carefully chosen, due to the great number of reflection and refraction events for a single ray.

We would like to experiment with different approaches and parameter tuning to resolve the challenges explained above, and hopefully obtain a good tradeoff in different algorithm evaluations.

Resources

Hardware

If our project remains ISPC only and does not end up using CUDA, we plan to develop and benchmark on the following devices:

Macbook Pro 16”
CPU: 2.4 GHz 8-Core Intel Core i9
Memory: 32 GB 2667 MHz DDR4
Graphics: Intel UHD Graphics 630 1536 MB

Macbook Pro 13”
CPU: 2.4 GHz Intel Core i5
Memory: 8 GB 2133 MHz LPDDR3
Graphics: Intel Iris Plus Graphics 655 1536 MB

If our project ends up involving CUDA, we will develop and benchmark the portion that involves CUDA on the GHC machines that have RTX 2080.

Software

We’re starting from a ray tracer implementation that does not utilize SIMD parallelization. Its Github repository is here.

Other than this base code, we’re also borrowing ideas from the “Application Case Studies” lecture, which introduced the ray packages idea and some possible optimization techniques around it.

We may also reference other online resources. The exact list of which will be included in future writeups, but here are some papers we found that might be related to our project:

Goals and Deliverables

As a baseline, we aim to achieve a reasonable speedup for rendering the classical cornell box using ISPC - for example, a 4x speedup using 8-wide vector lanes. This is an estimate based on reports of other people’s similar effort: https://ispc.github.io/perf.html. We plan to implement ray packet tracing which packs rays into a packet and maps the packet operations on to wide SIMD executions.

In terms of the challenges proposed in the above section, we also plan to implement a work assignment method that better addresses locality issues - if it is the bottleneck - and improves vector lane utilizations. Additionally, we plan to partition the process of ray tracing into path traversal and pixel value computation. By separating the two processes, we hope to reduce the dependencies compared to the recursive implementation and better improve the parallelizability of the code.

If time permits and we start working with CUDA, we hope to see speedups of at least 20x, making the renderer almost acceptable to be used in real time. Since CUDA has better computational power and execute individual thread more efficiently, when mapping the tasks (ispc tasks that we mapped onto cpu cores) to CUDA units, we are likely able to see performance improvement.

For deliverable, we will at least have a command-line-only build that renders an image, prints the execution time, and outputs a .ppm file. This will be the version we use for measuring speedup against the initial sequential version. If time permits and a good speedup is achieved so that the renderer operates almost in real time, we may also let our parallelized code work in the interactive window - just like how our sequential version works right now, and as tiles of the image get rendered, the result will be progressively updated onto the screen.

At the virtual poster session, we will show images rendered by our SIMD path tracer along with time spent rendering them and speedup graphs compared to rendering the same image using the sequential implementation. We will also include speedup graphs and explanations related to anything surprising or non-ideal. Since the session is virtual and we cannot demo anything in-person, we will at least also give a link to our source code, provide build and usage instructions, and provide direct download links to binary builds for people to quickly try it out.

Platform Choice

Our project can theoretically be run on any device that supports the corresponding SIMD instruction set. We’re starting with ISPC because it is the most convenient for both team members - we have apple laptops that don’t support CUDA. Keeping everything on the CPU may also potentially save us some initial debugging effort.

We’re still targeting CUDA as a stretch goal. We might have to resolve obstacles porting our ISPC code to CUDA. Once we gather speedup data using CUDA, we can also compare it to our speedup data with ISPC. Both experiences will hopefully give us more insight about both SIMD tools.

Tentative Schedule

Week of 11.2

Clean up the sequential implementation, port it to ISPC and make sure it compiles and runs, regardless of speedup.

Week of 11.9

Do our research and learn from other people’s previous effort in SIMD parallelization of path tracing. Implement an initial parallel version of the code and run in ispc.

Implement packet ray tracing. Modify the sequential version to remove dependencies between earlier and subsequent tracing of rays so that we can create tasks for packets of rays and make tasks be queued. This should give us some initial speedup.

Week of 11.16 (checkpoint)

Make some progress or serious effort in improving ray coherence and SIMD vector utilization by semi-dynamically re-organizing rays into packets based on specific heuristics.

Measure any speedup we have so far and prepare for the checkpoint writeup.

Reflect on our progress and see if the schedule needs to be adjusted.

Week of 11.23

Continue our effort in improving SIMD vector utilization; optimize iteratively as much as we can. If we’re ahead of the schedule, start porting our code to CUDA, and make sure it compiles and runs correctly on GHC machines.

Week of 11.30

Buffer time for in case we fall behind schedule, or port our optimization effort to CUDA and observe any speedup gain.

Week of 12.7

Prepare and clean up code for our deliverable(s), poster, website, and final project writeup.