Performance Study of General-Purpose GPU Computations Using Kernel Fusion

Open Access
- Author:
- Paluda, Michael
- Area of Honors:
- Computer Science
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Anand Sivasubramaniam, Thesis Supervisor
Danfeng Zhang, Thesis Honors Advisor - Keywords:
- GPU
CUDA
KERNEL
DYNAMIC PARALLELISM - Abstract:
- The highly parallel architecture of modern GPUs has ushered in an era of general-purpose GPU computation powered by Nvidia’s CUDA parallel computing platform. The GPU applications that are built on this software stack range from neural networks, bioinformatics, molecular dynamics, fast Fourier transformers, to many other scientific modeling applications. Therefore, a meaningful performance gain in CUDA programming would have beneficial downstream effects on the various above-mentioned applications. The research in this thesis specifically studied and tried several optimization techniques for reducing the execution time of launching kernels in CUDA programs. Initial attempts were made to utilize an extension to the CUDA model known as dynamic parallelism. However, launching kernels from the CPU was considerably more efficient than launching them using dynamic parallelism. We then tried employing a new feature in the CUDA platform known as CUDA graphs. We found the execution time of the kernels launched using the CUDA graph method was asymptotically the same as launching a series of kernels from the CPU, only with a higher initial overhead. Lastly, we tried fusing the kernels together and manually synchronizing them on the GPU. This method yielded significant performance gains for moderately sized kernels, however, the performance degraded for exceptionally large kernels. From the results of this research, further work should be done in investigating an optimal number of kernels to fuse together and manually sync. Additional work can also be done on manual synchronization optimization for exceptionally large kernels. Lastly, while dynamic parallelism provides an eloquent method of writing CUDA programs, often an iterative approach from the CPU can warrant better results.