Staging buffer

In the previous chapter we went over memory allocation and buffers in Vulkan, and extended our first triangle application to draw triangles from memory using vertex and index buffers.

However our vertices and indices reside in system memory instead of VRAM. In this chapter we are going to move our vertex and index buffers into VRAM, create a staging buffer in system memory, write our vertex and index data into the staging buffer, and issue transfer commands to copy our vertex and index data from the staging buffer to the vertex and index buffers.

This tutorial is in open beta. There may be bugs in the code and misinformation and inaccuracies in the text. If you find any, feel free to open a ticket on the repo of the code samples.

Introduction

So far our vertex and index buffer reside in system memory. This is fine for integrated GPUs, and it's not noticably slow for our triangle and quad on a dedicated GPU either, but it will be a problem if the program needs to draw anything nontrivial.

Part of the solution is moving our vertex and index buffer into VRAM. We can select VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT as a required memory property. Problem is, a memory type that is device local, host visible and host coherent may not be available, or if it is, it may be limited. Unless resizable BAR is enabled, a memory heap backing such a memory type will probably only make 256MB of VRAM available like this. You don't want to waste this memory.

So you need to somehow move data into VRAM, but not by directly writing it from the CPU. How can we do that? We do that using transfer commands. We can create a buffer in system memory. Let's call this the staging buffer! We can write vertex and index data into it like we did in the previous tutorial, and then copy this data into our vertex and index buffer in VRAM. Transfer commands need to be recorded into a command buffer and submitted into a device queue.

Now we can piece together how to copy our vertex and index data into VRAM:

We create a staging buffer in system memory.
Instead of writing the vertex and index data into the memory of the vertex and index buffer, we write them into the memory of the staging buffer.
We move our vertex and index buffer to VRAM
We create a new command pool and a command buffer
We record transfer commands that copy the vertex data into the vertex buffer and index data into the index buffer.
We submit the command buffer and wait for its execution.
We use the vertex and index buffer for rendering the same way we did previously, except this time everything will be faster.

Creating the Staging buffer

Creating the staging buffer will be almost the same as creating the vertex and index buffer. The only thing that is different are the usage flags and the size.

Let's calculate the minimum size for the staging buffer! It needs to hold the vertex and index data.


    //
    // Staging buffer size
    //

    let staging_buffer_size = vertex_data_size + index_data_size;

    let staging_buf_mem_props = (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT) as VkMemoryPropertyFlags;

We also know we want to put it into system memory, so the memory property flags will be VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT and VK_MEMORY_PROPERTY_HOST_COHERENT_BIT.

This buffer will not be used as a vertex or index buffer, instead it will be the data source of copy commands. We express this by adding VK_BUFFER_USAGE_TRANSFER_SRC_BIT to the usage flags.

The creation logic will be the following:


    //
    // Staging buffer
    //

    // Create buffer

    let staging_buffer_create_info = VkBufferCreateInfo {
        sType: VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO,
        pNext: core::ptr::null(),
        flags: 0x0,
        size: staging_buffer_size as VkDeviceSize,
        usage: VK_BUFFER_USAGE_TRANSFER_SRC_BIT as VkBufferUsageFlags,
        sharingMode: VK_SHARING_MODE_EXCLUSIVE,
        queueFamilyIndexCount: 0,
        pQueueFamilyIndices: core::ptr::null()
    };

    println!("Creating staging buffer.");
    let mut staging_buffer = core::ptr::null_mut();
    let result = unsafe
    {
        vkCreateBuffer(
            device,
            &staging_buffer_create_info,
            core::ptr::null(),
            &mut staging_buffer
        )
    };

    if result != VK_SUCCESS
    {
        panic!("Failed to create staging buffer. Error: {}.", result);
    }

    // Create memory

    let mut mem_requirements = VkMemoryRequirements::default();
    unsafe
    {
        vkGetBufferMemoryRequirements(
            device,
            staging_buffer,
            &mut mem_requirements
        );
    }

    let mut chosen_memory_type = phys_device_mem_properties.memoryTypeCount;
    for i in 0..phys_device_mem_properties.memoryTypeCount
    {
        if mem_requirements.memoryTypeBits & (1 << i) != 0 &&
            (phys_device_mem_properties.memoryTypes[i as usize].propertyFlags & staging_buf_mem_props) == staging_buf_mem_props
        {
            chosen_memory_type = i;
            break;
        }
    }

    if chosen_memory_type == phys_device_mem_properties.memoryTypeCount
    {
        panic!("Could not find memory type.");
    }

    let staging_buffer_alloc_info = VkMemoryAllocateInfo {
        sType: VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO,
        pNext: core::ptr::null(),
        allocationSize: mem_requirements.size,
        memoryTypeIndex: chosen_memory_type
    };

    println!("Staging buffer size: {}", mem_requirements.size);
    println!("Staging buffer align: {}", mem_requirements.alignment);

    println!("Allocating staging buffer memory.");
    let mut staging_buffer_memory = core::ptr::null_mut();
    let result = unsafe
    {
        vkAllocateMemory(
            device,
            &staging_buffer_alloc_info,
            core::ptr::null(),
            &mut staging_buffer_memory
        )
    };

    if result != VK_SUCCESS
    {
        panic!("Could not allocate memory. Error: {}.", result);
    }

    // Bind buffer to memory

    println!("Binding staging buffer memory.");
    let result = unsafe
    {
        vkBindBufferMemory(
            device,
            staging_buffer,
            staging_buffer_memory,
            0
        )
    };

    if result != VK_SUCCESS
    {
        panic!("Failed to bind memory to staging buffer. Error: {}.", result);
    }

At the end of the application we need to clean up like this.


    //
    // Cleanup
    //

    let result = unsafe
    {
        vkDeviceWaitIdle(device)
    };

    // ...

    println!("Deleting staging buffer device memory");
    unsafe
    {
        vkFreeMemory(
            device,
            staging_buffer_memory,
            core::ptr::null_mut()
        );
    }

    println!("Deleting staging buffer");
    unsafe
    {
        vkDestroyBuffer(
            device,
            staging_buffer,
            core::ptr::null_mut()
        );
    }

Now that we have our staging buffer and system memory backing it, it's time to copy the vertex and index data into it.

Uploading data to Staging buffer

Copying the vertex and index data into the staging buffer is done the same way as we previously did with the vertex and index buffer: map the memory, copy the data and unmap the memory.

The only twist is that instead of copying into the memory of different VkDeviceMemory objects, we copy into different memory regions of a single VkDeviceMemory.


    //
    // Uploading to Staging buffer
    //

    let vertex_data_offset = 0;
    let index_data_offset = vertex_data_size as u64;

    unsafe
    {
        let mut data = core::ptr::null_mut();
        let result = vkMapMemory(
            device,
            staging_buffer_memory,
            0,
            staging_buffer_size as VkDeviceSize,
            0,
            &mut data
        );

        if result != VK_SUCCESS
        {
            panic!("Failed to map memory. Error: {}.", result);
        }

        let vertex_data_offset = vertex_data_offset as isize;
        let vertex_data_void = data.offset(vertex_data_offset);
        let vertex_data_typed: *mut f32 = core::mem::transmute(vertex_data_void);
        core::ptr::copy_nonoverlapping::<f32>(
            vertices.as_ptr(),
            vertex_data_typed,
            vertices.len()
        );

        let index_data_offset = index_data_offset as isize;
        let index_data_void = data.offset(index_data_offset);
        let index_data_typed: *mut u32 = core::mem::transmute(index_data_void);
        core::ptr::copy_nonoverlapping::<u32>(
            indices.as_ptr(),
            index_data_typed,
            indices.len()
        );

        vkUnmapMemory(
            device,
            staging_buffer_memory
        );
    }

We created two variables, vertex_data_offset and index_data_offset. The first one will point to the start of the region reserved for vertex data, the second one will point to the region of the index data.

After copying the data into the staging buffer, we move the vertex and index buffer into VRAM.

Modifying Vertex and Index buffer

Now it's time to move our vertex and index buffer into VRAM. We modify the required memory properties and we also need to add new usage flags to the buffer. Then we delete the old code for copying the vertex and index data directly, as doing that is no longer allowed.

The new property flag will be VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT.


    //
    // Vertex and Index data
    //

    // ...

    // Vertex and Index buffer size

    // ...

    let vertex_buf_mem_props = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT as VkMemoryPropertyFlags;
    let index_buf_mem_props = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT as VkMemoryPropertyFlags;

Now it's time to update the buffer's usage flags. Just as we needed to mark the staging buffer as data source for transfer commands, we need to mark the vertex and index buffers as destinations. The usage flag for this is VK_BUFFER_USAGE_TRANSFER_DST_BIT.

First we add it to the vertex buffer.


    //
    // Vertex buffer
    //

    // Create buffer

    let vertex_buffer_create_info = VkBufferCreateInfo {
        sType: VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO,
        pNext: core::ptr::null(),
        flags: 0x0,
        size: vertex_data_size as VkDeviceSize,
        usage: (VK_BUFFER_USAGE_VERTEX_BUFFER_BIT |
                VK_BUFFER_USAGE_TRANSFER_DST_BIT) as VkBufferUsageFlags, // We changed this.
        sharingMode: VK_SHARING_MODE_EXCLUSIVE,
        queueFamilyIndexCount: 0,
        pQueueFamilyIndices: core::ptr::null()
    };

    // ...

Then we add it to the index buffer.


    //
    // Index buffer
    //

    // Create buffer

    let index_buffer_create_info = VkBufferCreateInfo {
        sType: VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO,
        pNext: core::ptr::null(),
        flags: 0x0,
        size: index_data_size as VkDeviceSize,
        usage: (VK_BUFFER_USAGE_INDEX_BUFFER_BIT |
                VK_BUFFER_USAGE_TRANSFER_DST_BIT) as VkBufferUsageFlags, // We changed this.
        sharingMode: VK_SHARING_MODE_EXCLUSIVE,
        queueFamilyIndexCount: 0,
        pQueueFamilyIndices: core::ptr::null()
    };

    // ...

Now that we no longer require VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT and VK_MEMORY_PROPERTY_HOST_COHERENT_BIT for the memory of the vertex and index buffer, we can no longer map them and copy into them. That code needs to be deleted.



    //
    // Uploading to Vertex buffer
    //

    // This must die.
    unsafe
    {
        let mut data = core::ptr::null_mut();

        // ...
    }

    //
    // Uploading to Index buffer
    //

    // This must die.
    unsafe
    {
        let mut data = core::ptr::null_mut();

        // ...
    }

Memory transfer to Vertex and Index buffer

This is the part where we create a command buffer, record copy commands into it and submit it.

First we create a command pool.
Then we create a command buffer backed by this pool.

Then we begin recording to it.
We record a copy command that copies the vertex data from the staging buffer to the vertex buffer.
Then we record a copy command that copies the index data from the staging buffer to the index buffer.
We end recording.

We submit the command buffer.
We wait for completion.
Since we won't need this command pool for the rest of the application, we delete it.

Let's get started. I create a new block for the transfer code and its resources.


    //
    // Memory transfer
    //

    {
        // ...
    }

We will need a command pool and a command buffer that we can record transfer commands to, so let's create them! We create them pretty much the same way we did during the clearing the screen chapter.


    //
    // Memory transfer
    //

    {
        let cmd_pool_create_info = VkCommandPoolCreateInfo {
            sType: VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
            pNext: core::ptr::null(),
            flags: 0x0,
            queueFamilyIndex: chosen_graphics_queue_family
        };

        println!("Creating transfer command pool.");
        let mut cmd_pool = core::ptr::null_mut();
        let result = unsafe
        {
            vkCreateCommandPool(
                device,
                &cmd_pool_create_info,
                core::ptr::null_mut(),
                &mut cmd_pool
            )
        };

        if result != VK_SUCCESS
        {
            panic!("Failed to create transfer command pool. Error: {}.", result);
        }

        println!("Allocating transfer command buffers.");
        let cmd_buffer_alloc_info = VkCommandBufferAllocateInfo {
            sType: VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
            pNext: core::ptr::null(),
            commandPool: cmd_pool,
            level: VK_COMMAND_BUFFER_LEVEL_PRIMARY,
            commandBufferCount: 1
        };

        let mut transfer_cmd_buffer = core::ptr::null_mut();
        let result = unsafe
        {
            vkAllocateCommandBuffers(
                device,
                &cmd_buffer_alloc_info,
                &mut transfer_cmd_buffer
            )
        };

        if result != VK_SUCCESS
        {
            panic!("Failed to create transfer command buffer. Error: {}.", result);
        }

        // ...
    }

Then we start recording to our command buffer. Begin...


    //
    // Memory transfer
    //

    {
        // ...

        let cmd_buffer_begin_info = VkCommandBufferBeginInfo {
            sType: VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
            pNext: core::ptr::null(),
            flags: VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT as VkCommandBufferUsageFlags,
            pInheritanceInfo: core::ptr::null()
        };

        let result = unsafe
        {
            vkBeginCommandBuffer(
                transfer_cmd_buffer,
                &cmd_buffer_begin_info
            )
        };

        if result != VK_SUCCESS
        {
            panic!("Failed to start recording the comand buffer. Error: {}.", result);
        }

        // ...
    }

This is almost the same as the vkBeginCommandBuffer call during rendering commands, but following the example of Alexander Overvoorde's staging buffer tutorial we add VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT to the flags, since we only submit this command buffer once. It's good practice to tell the driver as much as possible about our application, because it may be able to optimize.

Now we record our copy commands.


    //
    // Memory transfer
    //

    {
        // ...

        let copy_region = [
            VkBufferCopy {
                srcOffset: vertex_data_offset,
                dstOffset: 0,
                size: vertex_data_size as u64
            }
        ];
        unsafe
        {
            vkCmdCopyBuffer(
                transfer_cmd_buffer,
                staging_buffer,
                vertex_buffer,
                copy_region.len() as u32,
                copy_region.as_ptr()
            );
        }

        let copy_region = [
            VkBufferCopy {
                srcOffset: index_data_offset,
                dstOffset: 0,
                size: index_data_size as u64
            }
        ];
        unsafe
        {
            vkCmdCopyBuffer(
                transfer_cmd_buffer,
                staging_buffer,
                index_buffer,
                copy_region.len() as u32,
                copy_region.as_ptr()
            );
        }


        // ...
    }

The vkCmdCopyBuffer function takes an array of VkBufferCopy structs. These define the source offset, the destination offset, and the size of the data to be transferred. The function also has the parameters srcBuffer and dstBuffer defining the source and destination buffer. The intended way of using this function is to collect copy commands between the same buffers into arrays and call one vkCmdCopyBuffer for every one of them.

Now we can end recording the command buffer.


    //
    // Memory transfer
    //

    {
        // ...

        let result = unsafe
        {
            vkEndCommandBuffer(
                transfer_cmd_buffer
            )
        };

        if result != VK_SUCCESS
        {
            panic!("Failed to end recording the comand buffer. Error: {}.", result);
        }

        // ...
    }

Now that the transfer command buffer is recorded, we need to submit it so the GPU can execute it. For the sake of simplicity we submit it to the graphics queue.

In a real world application you may want to submit transfer commands from dedicated transfer queues. Look for the VK_QUEUE_TRANSFER_BIT flag during queue family selection!



    //
    // Memory transfer
    //

    {
        // ...

        let cmd_buffer = [transfer_cmd_buffer];

        let submit_info = VkSubmitInfo {
            sType: VK_STRUCTURE_TYPE_SUBMIT_INFO,
            pNext: core::ptr::null(),
            waitSemaphoreCount: 0,
            pWaitSemaphores: core::ptr::null(),
            pWaitDstStageMask: core::ptr::null(),
            commandBufferCount: cmd_buffer.len() as u32,
            pCommandBuffers: cmd_buffer.as_ptr(),
            signalSemaphoreCount: 0,
            pSignalSemaphores: core::ptr::null()
        };

        let result = unsafe
        {
            vkQueueSubmit(
                graphics_queue,
                1,
                &submit_info,
                core::ptr::null_mut()
            )
        };

        if result != VK_SUCCESS
        {
            panic!("Failed to submit transfer commands: {:?}.", result);
        }

        // ...
    }

Instead of supplying a fence, this time we wait for it using the vkQueueWaitIdle. This is not a particularly good practice, but for our example application it simplifies life.


    //
    // Memory transfer
    //

    {
        // ...

        //
        // Cleanup
        //

        let _result = unsafe
        {
            vkQueueWaitIdle(
                graphics_queue
            )
        };

        // ...
    }

In a real world application you probably want to create a fence and check that fence. This enables loading models during rendering, checking that fence every frame, and only render a model once the transfer is completed. You can live with object popping, show a loading screen, or not open a door until the next room is loaded.

Now that the command buffer is executed, we no longer need it, and for the sake of simplicity we can destroy it.

In a real world application you probably don't want to delete the command pool reserved for loading. Instead you probably want to reset it and reuse it when you need to load a new model.


    //
    // Memory transfer
    //

    {
        //
        // Cleanup
        //

        // ...

        println!("Deleting transfer command pool.");
        unsafe
        {
            vkDestroyCommandPool(
                device,
                cmd_pool,
                core::ptr::null_mut()
            );
        }
    }

...and that's it! Now we can compile our application and our triangle will be drawn again, but this time, from VRAM! (Assuming you have a discrete graphics card.)

Should I use a staging buffer?

When we get to texturing, we will need a staging buffer. For buffers on the other hand, it depends.

For integrated GPUs (such as notebooks with an APU and mobile) it's a waste of time and performance. You literally copy data from system memory to system memory using the transfer queue.

For dedicated GPUs with resizable BAR, the whole VRAM will be host visible, and using transfer queues for uploading buffer contents is also a waste of time. Adam Sawicki also mentions this in his blog post about Vulkan memory types. In such cases you should place your buffers into device local and host visible memory.

For dedicated GPUs without resizable BAR, you are better off with staging buffers, because host visible VRAM is limited (One of my GPUs has 256MB of host visible VRAM), and you don't want to waste it, because you may want to use it for data you need to upload every frame, such as the uniform buffers in the next chapter, vertex buffers for CPU particles or other dynamically updated meshes and so on. You should place buffers that aren't updated frequently (like every or almost every frame) into VRAM, because it can lead to a performance hit, as it did with the Cherno in his Hazel engine.

If you target both, because your application needs to run on desktops with both dedicated and integrated cards, and maybe also mobile, you probably want to implement both schemes. Nowadays notebook APUs can run games, so this may not be a waste of time.

In the rest of the chapters I will keep using staging buffers for the sake of simplicity, but this is not recommended for every kind of GPU. These are tutorial samples, not production applications.

Bonus: Notes on memory types

Now we have elaborated a bit on memory types that may come from different heaps it's time I fulfilled a promise I made in the previous chapter.

It may happen that you run out of VRAM. In such cases the memory allocation may or may not fail. Right now if it does, we panic, but this may not be acceptable in a real world application. It is an option to try to allocate memory from a different memory type and memory heap, such as system memory. It may slow the application down, but at least it doesn't crash.

Naturally this cannot be done naively. An interesting case study is when Feral Interactive ported Warhammer 40k Dawn of War III, and their resource allocation strategy resulted in render targets (images used as attachments during rendering) being allocated in system memory when running out of VRAM. This resulted in a slowdown compared to the OpenGL version, and they solved it by making room for render targets in VRAM by defragmentation and evicting other resources. The full presentation can be found on youtube.

Wrapping up

In this chapter we moved our vertex and index buffer into VRAM, created a staging buffer where we can load resources, and used transfer command to move our models from system memory to VRAM.

However so far our application only draws a single triangle and a single quad. If we wanted to draw many triangles or quads, we would need to upload every vertex position into our vertex and index buffer. In the next chapter we will extend our application to render scenes with many instances of the same vertex and index data.

The sample code for this tutorial can be found here.

The tutorial continues here.