Written by Max Röhrbein-Kling and Johannes Kuhlmann
When we started working on the Android release of Galaxy on Fire 3 - Manticore, our goal was for everyone to be able to enjoy the same visual fidelity, no matter if playing on Android or iOS. On iOS, we were already using Metal to render the game. This helped us to push a lot of draw calls. We therefore decided to add Vulkan support to our in-house game engine, the ABYSS® Engine, and render our game with Vulkan on Android.
Vulkan was and still is quite a young API. It is the latest addition to the family of low-level graphics APIs that make rendering 3D graphics more efficient. We are not going to go into detail on all the advantages and features Vulkan has to offer. Instead we will focus on our experience with bringing our game to Vulkan in general and with Android as the target platform in particular.
While there is a lot of high-quality content out there, you will not find too many first-hand reports on what actually happened to game developers or engine developers brave enough to go down the Vulkan rabbit hole. So, in this series of blog posts, we will mainly focus on the most interesting aspects of our implementation, before we shift over to the learnings we made along the way.
There will be five separate posts:
- Introduction and Fundamentals (this post)
- Handling Resources and Assets
- What We have Learned
- Vulkan on Android
- Stats & Summary
As a disclaimer, the main objective of our Vulkan renderer was to ship a game. That means it is more pragmatic than perfected. Also, we have mainly done what worked for us. We have not used any fancy stuff like custom allocators, parallel command generation, reusing command buffers and so on. We do believe, though, that our implementation is still reasonably versatile and well done.
So, for the first post, let us talk about some of the most interesting aspects of our Vulkan renderer implementation.
DescriptorSets are rather unique to Vulkan as there is nothing like it in, for example, Metal. A DescriptorSet is basically a group of bindings that is used to get your data into shaders. To get your uniform/constant values and textures in there, you must update these DescriptorSets. The tricky thing is you cannot update them while they are in use. And, being in use here factors in the CPU (your code as well as the driver's) and the GPU. You can only update a DescriptorSet when the GPU does not need it anymore, i.e. it has finished rendering.
So, in order to update DescriptorSets (using
vkUpdateDescriptorSets()) and reuse them once you they can be reused, you would need to do a lot of tracking and management. We have ended up with two different solutions for uniforms and textures that solve this problem for us.
For uniforms, we use dynamic buffer offsets (
VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC). This way we can have one DescriptorSet per shader that we can reuse each time the shader is used for rendering. It has to be one per shader as the layout/bindings in the DescriptorSet depend on the uniforms declared in the shader. The dynamic offset removes the need for us to update the DescriptorSets per frame, but forces us to put all uniform data into a single buffer. This buffer needs to be large enough to fit the entirety of uniform data for however many frames we have simultaneously in the pipeline (again, CPU and GPU pipelines combined).
The texture bindings highlighted a homemade problem for us: our renderer interface allows us to render any model (mesh + shader) with arbitrary textures. As a result, the Vulkan renderer needed to be able to deal with objects changing textures every frame, even if that feature is not used all too often.
At first, we tried tracking and managing DescriptorSets for textures, reusing them once the GPU was done with them and basically calling
vkUpdateDescriptorSets() once for each object to be drawn. However, this was way too slow. It more or less doubled the time we spent generating draw calls.
We then tried caching DescriptorSets for textures based on the shader and bound textures - a procedure that allows us to potentially share them between similar objects. Consequently, we now only need to update them when a previously unseen combination occurrs. Apparently, this works really well for us and means we almost never have to update the DescriptorSets again after the cache has been filled.
Getting the synchronization right was probably one of the hardest parts of implementing the Vulkan renderer. While it is not an extraordinarily tough problem per se, we did not even have to think about it in OpenGL or Metal, where the driver does the synchronization for you. In Vulkan, however, you may have artifacts appear randomly in the rendered image. To fix this, you need to find the place where you did not synchronize correctly.
Generally, there are two categories of synchronization problems you need to safeguard against:
- CPU <-> GPU synchronization
- GPU command <-> GPU command synchronization
An example for the first category is how we create meshes. Since we are targeting mobile devices where CPU and GPU share a unified memory, we place our mesh data in
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT memory. To upload the data, we map the mesh's buffer(s), memcpy the data into the buffer and then unmap buffer.
First, we need to make sure the GPU does not access that buffer before we are done with the memcpy. In our case this is already guaranteed by our rendering interface. Our implementation provides a thread-safe way of creating mesh objects which is done synchronously on the calling thread before returning a pointer to the mesh object. This guarantees that a mesh cannot possibly be submitted for rendering before its data upload is complete.
But we still need to tell the GPU that we modified the memory because the device may have cached parts of the mesh buffer. We can either choose our buffer memory to also have the
VK_MEMORY_PROPERTY_HOST_COHERENT_BIT or manually declare our modification by calling
vkFlushMappedMemoryRanges() to commit our changes and invalidate relevant GPU caches.
Next, we need to place a
VkBufferMemoryBarrier in a command buffer before the first use of the mesh data. In this case, the barrier signals a write access by the host which needs to be made visible to shaders reading that buffer. As such the barrier needs to take effect between
The second category of synchronization problems is mostly relevant for data that is read and modified during the rendering of a single frame. To highlight what problems can occur and how to solve them, let us take a look at how we generate command buffers and submit them to the graphics queue.
But first a bit of setup: For the sake of simplicity let us assume we created a swapchain with three images. In theory, all three images can simultaneously be in processing at different stages in the graphics pipeline at the same time. Since we cannot reuse command buffers before they are done executing, we also need three command buffers and three fences. We will be referring to these by index. We need two semaphores: the
renderingFinishedSemaphore. We also need to keep track of the
frameNumber, i.e. how many frames we have rendered already.
With that sorted out, we are ready to start. We need to acquire an image to render into by calling
vkAcquireNextImageKHR(). Here, we pass in the
imageReadySemaphore which will then be signalled once the image is ready to be accessed again. Noteworthy is that the images may be acquired in an arbitrary order.
Before we can start filling a command buffer, we need to check if it is ready. We determine the
frameIndex (between 0 and 2) via the
frameIndex = frameNumber % 3
We then wait for the
fence[frameIndex]. Now we can reset
fence[frameIndex] as well as
commandBuffer[frameIndex]and then start filling the command buffer with commands. This part is entirely up to you. The next interesting thing for synchronization purposes is the submission.
vkQueueSubmit() of your rendering commands you need to pass in the
imageReadySemaphore to wait on. The
renderingFinishedSemaphore and the
fence[frameIndex] have to be signalled. We need to wait on the
imageReadySemaphore to prevent rendering into our image before the image is actually ready for reuse as determined by the swapchain. Similarly, we want to signal the
renderingFinishedSemaphore. It will not only be used for presentation, but also have the device signal the fence corresponding to the command buffer we are submitting once it is done.
Finally, it is time to present.
vkQueuePresentKHR() needs to wait on the
renderingFinishedSemaphore, otherwise the image might be displayed on the screen before all commands have finished.
What we found helpful when tracking down synchronization issues was to place
vkDeviceWaitIdle() calls in code sections which are most likely to have synchronization problems. Sometimes the nature of the rendering artifacts lets you make assumptions about what is going wrong. The
vkDeviceWaitIdle() will not tell you how to fix the issue correctly and you certainly do not want to use it in production code. Still, it can highlight where synchronization is missing in the code.
This post addressed two challenges you are likely to face with Vulkan: DescriptorSets and synchronization. Both are not necessarily common in other graphics APIs and may therefore be difficult to integrate into an existing engine. It is, however, essential to get them right as you will be plagued by crashes and visual corruption otherwise.
In the next post, we will talk about assets like textures and shaders as well as graphics pipelines.