Learn about some of the main factors that affect bandwidth between a GPU and a system on a Mac.
The bandwidth between a GPU and a system is a crucial topic when developing high-performance Metal apps. Some GPUs are very powerful on their own, but this power can be severely degraded if a user has a suboptimal system setup or if your app is using a suboptimal GPU for a specific task.
In general, external GPUs are more powerful than many built-in GPUs (integrated or discrete). However, external GPUs typically have a lower bandwidth than built-in GPUs. Thus, data transfers between a system and an external GPU can be more expensive than data transfers between a system and its built-in GPUs. Additionally, data transfers between GPUs incur a significant cost because this process typically requires intermediary data transfers to the system; data can't be transferred directly between GPUs.
Bandwidth is largely determined by the bus that connects a GPU to a system. This bus varies according to different types of GPUs:
Integrated GPUs are built-in GPUs that use the same system memory and bus as the CPU; they don't have a have a separate interface.
Discrete GPUs are built-in GPUs that are connected to a system by an internal PCIe bus. Depending on the specific GPU and Mac model, this type of bus can have a width of 8 (PCIe x8) or 16 (PCIe x16) memory lanes.
External GPUs are connected to a system by an external Thunderbolt 3 bus.
PCIe x16 has twice as much bandwidth as PCIe x8 and four times as much bandwidth as Thunderbolt 3.
Resource Storage Modes
Bandwidth costs are minimized when data transfers across a bus are also minimized. This optimization is largely influenced by the storage mode of a resource, which determines the memory location and access permissions of a resource.
Shared resources are stored in system memory. Shared resources can be accessed by both the CPU and the GPU. This memory location means that a discrete GPU can access the resource only via a PCIe bus, and an external GPU can access the resource only via a Thunderbolt 3 bus. Compared to accessing video memory, accessing system memory is relatively slow for a discrete GPU and considerably slower for an external GPU. Thus, shared resources incur the highest bandwidth and data transfer costs.
Private resources are stored in video memory. Private resources can be accessed only by the GPU. This memory location means that discrete and external GPUs can access a resource directly from within their own video memory. Compared to accessing system memory, accessing video memory is much faster for discrete and external GPUs. Thus, private resources incur the lowest bandwidth and data transfer costs.
Managed resources are stored as a dual copy in both system memory and video memory. The resource copy in system memory can be accessed only by the CPU, and the resource copy in video memory can be accessed only by the GPU. These memory locations mean that discrete and external GPUs have fast access to the resource copy in video memory, but slower access via a blit operation to the resource copy in system memory. Managed resources have some costs associated with accessing system memory, but these costs are reduced by efficient blits. (Sparse blits between system memory and video memory are much faster than frequent and direct system memory access.)
Presenting a drawable on a display incurs significant bandwidth costs if the drawable has to be transferred between GPUs. Each display, whether it's built in or external, is driven by a single GPU. Therefore, the fastest path to present a drawable to any given display is to render that drawable with the GPU that drives the display. Otherwise, the drawable has to be transferred across from the GPU that renders it to the GPU that's driving the display.
An example is a Mac with a discrete GPU, connected to an external GPU that's also connected to an external display (where the external GPU drives the external display). If a drawable is rendered with a discrete GPU, the system has to transfer this drawable to the external GPU via the Thunderbolt 3 bus. To avoid this transfer, the drawable should instead be rendered with the external GPU.
In Macs with multiple built-in GPUs, drawable transfers may also occur if different GPUs render and present the drawable. An example is a MacBook Pro with an integrated and discrete GPU, with automatic graphics switching enabled (where the integrated GPU can drive the MacBook Pro's display). If a drawable is rendered with a discrete GPU, the system has to transfer this drawable to the integrated GPU via the PCIe bus. To avoid this transfer, the drawable should instead be rendered with the integrated GPU.