Written by Max Röhrbein-Kling and Johannes Kuhlmann
This is part three of our series of blog posts on our experience with bringing Galaxy on Fire 3 - Manticore to Vulkan.
Our posts follow this structure:
- Introduction and Fundamentals
- Handling Resources and Assets
- What We Have Learned (this post)
- Vulkan on Android
- Stats & Summary
Before we dive into the matter, we would like to point out once more that the focus of our Vulkan renderer was to ship a game. Thus, it is more pragmatic than perfect and we have mainly done what has worked for us - so there is no guarantee that our best practices will also work for you. We do believe, however, that our implementation is still reasonably versatile and well done. And we hope that you can learn a thing or two from our approach.
This third post covers a bunch of little problems we encountered while implementing Vulkan support and getting it to work on different devices.
You Have to Pay Respect
One of the things we learned when bringing our Vulkan renderer to different devices and GPUs was that it is very important to pay respect to the targeted devices' limits, properties, and capabilities. With Vulkan, almost anything can be queried. Most things you have to query before you go ahead and do it. Some things we knew about, some came as a surprise. To shed more light on this issue, we are now going to discuss a few examples that struck us as the most important ones.
One such example is
maxMemoryAllocationCount of a Vulkan device. This number tells you how many
VkDeviceMemoryallocations can be live at the same time. Per spec, the minimum is 4,096. As it turns out, there are numerous devices out there that have a maximum of 4,096 (Adreno GPUs, for example) - a number we are apparently able to exceed with all the vertex/index buffers, uniform buffers, textures, and so on. So, as most Vulkan tutorials recommend, you should think about memory management early on and not make one allocation per buffer.
We also had problems with the alignment of our data. There is the device's
minMemoryMapAlignment and then there is the alignment you get from
vkGetBufferMemoryRequirements(). In some cases, we had to take the maximum of the two in order to get the correct alignment.
You will also want to check the API version of the device you have. Make sure it is actually the one you programmed against. We had some Vulkan implementations where the major version was 0 and things like the
VK_KHR_swapchain extension or validation layers were not available at all.
A Vulkan implementation must support one of the following texture compression methods: ASTC, BC, or ETC2. If you want to be compatible with all Vulkan devices, you better support all three texture compression formats. However, on Android, we have gotten away with only supporting ETC2 so far. This is due to the fact that all devices we target support it (presumably due to it being required by OpenGL ES 3).
The framebuffer formats also vary from device to device (be it RGBA or BGRA, or whatever). And depth and stencil formats are wildly different between GPUs, too. They can also be combined or not, for example.
Additionally, there are some other small things like the alpha compositing mode (
VkDisplayPlaneAlphaFlagBitsKHR) or the surface transform (
VkSurfaceTransformFlagBitsKHR) for which different values are supported by different devices.
Validity Does not Imply Correctness
If you have worked with Vulkan, you have hopefully also used its validation layers. These layers basically stand in-between your game/application and the Vulkan driver, making sure your usage is valid. The layer concept is not exclusive to validation, but it is normally the first use case you come in touch with. There are different validation layers that focus on different areas where problems might occur. For example, there are layers for parameter validation, API state, and correct threading. The validation layers are Open Source and can be found on GitHub.
The beauty of the layer concept is you can enable error checking and validation when you need it and disable it to have zero overhead when you do not need it. We would like to emphasize that you really should use the validation layers whenever something is wrong. Make them super-easy to enable (without rebuilding) and maybe even run with them enabled by default if performance permits.
The validation layers are still constant subject to change. Even which individual layers there are and what layer does what is not set in stone yet. Therefore, you should get them from GitHub and build them yourself if you want the latest checks.
You should load the validation layers in order. Contrary to what we had assumed in the beginning, the validation results may be influenced by the order you specify the layers to be loaded in. There is a recommended order which can be found here and here.
The validation layers are specific to the device you are running on. Of course, their code is the same for all devices, but the layers ask the given Vulkan device for its limits and capabilities. The layers do their validation based on that information. So, just because you do not have any validation issues on one device, it does not mean you will not have any issues on other Vulkan devices.
One very Android-specific issue we faced was that at some point our validation layers were not found by the Vulkan driver anymore. It just kept looking in the wrong folders. This happened after we integrated the Google Play services into our game. The same engine was able to find and load the layers successfully in a rather empty project without anything from Google Play in it. This bug seems to be specific to ARM Mali GPUs. We have reported the problem and hopefully it is going to be fixed at some point.
It is Easy to Lose Your Device
One thing that eventually will happen with Vulkan is that you do something wrong. On Android, you will then most likely get the dreaded
VK_ERROR_DEVICE_LOST result which means your driver or GPU had to be restarted. Your Vulkan device will probably not recover after this without a restart of the game. It may continue rendering, albeit with worse performance as the error keeps occurring.
This problem is very difficult to debug because everything is asynchronous. Via the Vulkan API you submit commands to the driver. While these commands may already be processed asynchronously, they will most likely also be executed asynchronously on the GPU. The result from the execution will then again be sent back to the driver asynchronously. If there is an error in this procedure, your Vulkan code will only detect it when it tries to call one of the Vulkan functions that may return the error. So, it is rather difficult to pinpoint the one Vulkan call causing the problem.
If you get
VK_ERROR_DEVICE_LOST, it is best to simplify what you are rendering until the problem does not occur anymore. From there, you might be able to debug whatever problem you have with one specific asset or its render setup.
For us, causes of a lost device included:
- reading and computing things with garbage uniform data
- sampling textures to whom nothing was ever assigned to
- synchronization issues, i.e. reusing or destroying resources while they were still used elsewhere
Of course, there were certainly other causes as well. But the ones listed above struck us as the most prominent.
Drivers Have Issues, too
Normally, when something goes wrong, we first suspect it to be our fault - a notion that very often turns out to be true. However, this changed a bit when we were working with Vulkan on Android. At various points, we had to accept that it might not have been our fault and that there might simply be problems we could not fix at all.
In addition to the validation layer problem in one of the previous sections, there were two cases where we concluded that the driver must be at fault. Both happened on Qualcomm Adreno GPUs. Of course, to say it again, there is still a chance that these problems were caused by something we did. But in both cases, we could work around the symptoms by avoiding the addressed features.
In the first case, our uniform structs were assigned to the wrong uniforms in the shader. So, basically our rendered objects behaved very weirdly as their uniforms did not have any sensible values. We tracked this problem down to our DescriptorSets somehow being wrong. We made sure our binding numbers were correct in
VkDescriptorSetLayoutBinding. And everything seemed fine, but our uniforms were still just as wrong as before. As it turned out, the driver was not using the binding numbers to bind the uniform structs, but instead their index in the array of
VkDescriptorSetLayoutCreateInfo. A simple sort by binding index fixed that problem after days of debugging.
Another problem manifested itself with lots of flickering of one or multiple objects. The flickering intensified when pulling down the Android menu from the top of the screen. This problem was apparently somehow caused by our usage of dynamic pipeline states. Specifically, we had the viewport and scissor rect marked as being dynamic which results in them not being part of the graphics pipeline state. Disabling these features made the flickering go away. Luckily, we did not really make use of the ability to change around the viewport or scissor rect a lot.
In addition to things simply being broken, we also had problems with various GPUs behaving drastically different from each other. These were problems where we were wrong according to the Vulkan specification, but everything was still working correctly on a lot of devices. On others, however, this was not the case.
For example, we could map a memory buffer twice (using
vkMapMemory()) even though that should not be possible. Or we forgot to specify external dependencies for our subpasses (
VkRenderPassCreateInfo), but everything still rendered fine on all devices. At least most of the time. On some devices we encountered some objects to be missing occasionally for one frame at a time. Another example is that we specified a depth of 0 in
VkImageCreateInfo for our 2D textures and it still worked on quite a few devices. The correct value should have been 1, obviously.
There are lots of little - or not so little - things that can go wrong in a Vulkan application. Luckily, the validation layers report most of these issues.
And then there are issues where you spend ages looking for the problem, have checked the Vulkan standard a dozen times, but everything looks correct to you. In such cases, it might help to assume your driver may not be implementing the standard correctly and see if that reveals a workaround to you.
In the next post, we will talk about some of the challenges we faced when bringing the Vulkan renderer to Android.