View in English

  • Apple Developer
    • Get Started

    Explore Get Started

    • Overview
    • Learn
    • Apple Developer Program

    Stay Updated

    • Latest News
    • Hello Developer
    • Platforms

    Explore Platforms

    • Apple Platforms
    • iOS
    • iPadOS
    • macOS
    • tvOS
    • visionOS
    • watchOS
    • App Store

    Featured

    • Design
    • Distribution
    • Games
    • Accessories
    • Web
    • Home
    • CarPlay
    • Technologies

    Explore Technologies

    • Overview
    • Xcode
    • Swift
    • SwiftUI

    Featured

    • Accessibility
    • App Intents
    • Apple Intelligence
    • Games
    • Machine Learning & AI
    • Security
    • Xcode Cloud
    • Community

    Explore Community

    • Overview
    • Meet with Apple events
    • Community-driven events
    • Developer Forums
    • Open Source

    Featured

    • WWDC
    • Swift Student Challenge
    • Developer Stories
    • App Store Awards
    • Apple Design Awards
    • Apple Developer Centers
    • Documentation

    Explore Documentation

    • Documentation Library
    • Technology Overviews
    • Sample Code
    • Human Interface Guidelines
    • Videos

    Release Notes

    • Featured Updates
    • iOS
    • iPadOS
    • macOS
    • watchOS
    • visionOS
    • tvOS
    • Xcode
    • Downloads

    Explore Downloads

    • All Downloads
    • Operating Systems
    • Applications
    • Design Resources

    Featured

    • Xcode
    • TestFlight
    • Fonts
    • SF Symbols
    • Icon Composer
    • Support

    Explore Support

    • Overview
    • Help Guides
    • Developer Forums
    • Feedback Assistant
    • Contact Us

    Featured

    • Account Help
    • App Review Guidelines
    • App Store Connect Help
    • Upcoming Requirements
    • Agreements and Guidelines
    • System Status
  • Quick Links

    • Events
    • News
    • Forums
    • Sample Code
    • Videos
 

Vídeos

Abrir menu Fechar menu
  • Coleções
  • Todos os vídeos
  • Sobre

Mais vídeos

  • Sobre
  • Resumo
  • Código
  • Otimize operações personalizadas de aprendizado de máquina com tensores do Metal

    Aproveite um desempenho avançado de aprendizado de máquina com a API Metal Tensor e a biblioteca Metal Performance Primitives (MPP) Tensor Ops. Descubra como criar operações portáteis que utilizam os Neural Accelerators nas GPUs M5 e A19 da Apple. Aprenda a criar kernels personalizados de aprendizado de máquina para apps do Core AI e descubra como trabalhar de forma eficiente com formatos de dados quantizados e otimização de memória da GPU.

    Capítulos

    • 0:00 - Introduction
    • 0:21 - Apple's ML software stack
    • 2:25 - Managing quantized data
    • 4:23 - Multi-plane tensors
    • 5:17 - Quantized matrix multiplication
    • 9:31 - Building advanced ops
    • 13:35 - Integrating custom ops into Core AI
    • 15:25 - Next steps

    Recursos

    • Running inline ML operations in a shader with Metal 4
    • Machine learning passes
    • Download the Metal Performance Primitives (MPP) Programming Guide
    • Metal Performance Shaders
      • Vídeo HD
      • Vídeo SD

    Vídeos relacionados

    Tech Talks

    • Acelere suas cargas de trabalho de aprendizado de máquina com as GPUs dos chips M5 e A19

    WWDC25

    • Combine o aprendizado de máquina e os gráficos do Metal 4
  • Buscar neste vídeo...
    • 3:53 - Create a quantized MTLTensor

      // Creating a tensor with a quantized data type from device
      
      #define RANK 2
      
      MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];
      
      tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
      tensorDesc.usage = MTLTensorUsageCompute;
      
      NSInteger dimensions[RANK] = {NumCols, NumRows};
      tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];
      
      NSError *err = nil;
      id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];
    • 4:48 - Declare a multi-plane tensor with scale factors

      // Creating a tensor with a scales auxiliary plane from device
      
      #define RANK 2
      
      MTLTensorAuxiliaryPlaneDescriptor *planeDesc = [MTLTensorAuxiliaryPlaneDescriptor new];
      planeDesc.dataType = MTLTensorDataTypeMetalFloat8UE8M0;
      
      NSInteger blockFactors[RANK] = {32, 1};
      planeDesc.blockFactors = [[MTLTensorExtents alloc] initWithRank:RANK values:blockFactors];
      
      MTLTensorAuxiliaryPlaneDescriptorMap *auxiliaryPlanes =
          [MTLTensorAuxiliaryPlaneDescriptorMap new];
      [auxiliaryPlanes setDescriptor:planeDesc forPlane:MTLTensorPlaneTypeScales];
      
      MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];
      tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
      tensorDesc.usage = MTLTensorUsageCompute;
      
      NSInteger dimensions[RANK] = {NumCols, NumRows};
      tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];
      tensorDesc.auxiliaryPlanes = auxiliaryPlanes;
      
      NSError *err = nil;
      id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];
    • 6:07 - MSL type aliases for an MXFP8 tensor handle

      // Type aliases for a MXFP8 multi-plane tensor handle
      
      #include <metal_tensor>
      
      using namespace metal;
      
      using scales_plane = tensor_blockwise<tensor_plane_scales,
                                            device metal_fp8_ue8m0_format,
                                            32, 1>;
      
      using mxfp8_tensor = tensor<device metal_fp8_e4m3_format,
                                  dextents<int, 2>,
                                  tensor_handle,
                                  scales_plane>;
      
      kernel void matmul(mxfp8_tensor matrixA [[buffer(0)]],
                         mxfp8_tensor matrixB [[buffer(1)]],
                         tensor<device half, dextents<int, 2>> matrixC [[buffer(2)]])
      {
          // ...
      }
    • 6:51 - Declare an inline MXFP8 tensor on the stack

      // Type aliases for a MXFP8 multi-plane tensor inline
      
      #include <metal_tensor>
      
      using namespace metal;
      
      using scales_plane = tensor_blockwise<tensor_plane_scales,
                                            device metal_fp8_ue8m0_format,
                                            32, 1>;
      
      using mxfp8_tensor_inline = tensor<device metal_fp8_e4m3_format,
                                         dextents<int, 2>,
                                         tensor_inline,
                                         scales_plane>;
      
      // Construct tensor on the stack from buffer pointers
      mxfp8_tensor_inline matrixA(dataBufferA,
                                   dextents<int, 2>(K, M),
                                   array<int, 2>({ 1, K }),
                                   scales_plane(scalesBufferA));
    • 7:19 - Slice tensors and run a quantized matmul

      // Slice the tensors to extract the relevant tile
      auto tA = matrixA.slice(0, tgid.y * TILEM);
      auto tB = matrixB.slice(tgid.x * TILEN, 0);
      auto tC = matrixC.slice(tgid.x * TILEN, tgid.y * TILEM);
      
      // Set up the matmul descriptor
      constexpr auto descriptor = matmul2d_descriptor(TILEM,                  // M
                                                      TILEN,                  // N
                                                      dynamic_length_v<int>,  // K
                                                      false,   // Left matrix transposed
                                                      false);  // Right matrix transposed
      
      matmul2d<descriptor, execution_simdgroups<4>> op;
      
      // Run the op — TensorOps handles dequantization automatically
      op.run(tA, tB, tC);
    • 10:27 - Set up simdgroup-scoped QxK multiplication

      // Setup QxK matrix multiplication op
      constexpr auto mul_qk_op_desc = matmul2d_descriptor(/* ... */);
      matmul2d<mul_qk_op_desc, execution_simdgroups> mul_qk_op;
      
      // Slice Q, K, V
      auto tQSlice = tQ.slice<D, ROWS_PER_SIMD>(0, sgid * ROWS_PER_SIMD);
      auto tKSlice = tK.slice<D, BK>(0, k);
      auto tVSlice = tV.slice<D, BK>(0, k);
      
      // Create cooperative tensor to store tile of QxK
      auto ctQK = mul_qk_op.get_destination_cooperative_tensor<decltype(tQSlice),
                                                               decltype(tKSlice),
                                                               float>();
      
      // Multiply QxK
      mul_qk_op.run(tQSlice, tKSlice, ctQK);
    • 11:18 - Compute row-wise reduction for SoftMax

      // Create a cooperative tensor to store row reduction output
      auto ctTileRowMax = mul_qk_op.get_row_reduction_destination_cooperative_tensor<
                              decltype(tQSlice),
                              decltype(tKSlice),
                              float>();
      
      // Compute max over each row of QxK tile
      reduce_rows(ctQK, ctTileRowMax, reduction_operation::max, -INFINITY);
    • 11:56 - Compute element-wise SoftMax with map_iterator

      // Iterate over elements of QxK tile
      #pragma clang loop unroll(full)
      for (auto it = ctQK.begin(); it != ctQK.end(); it++) {
          // Fetch row max corresponding to this element
          auto row_it = ctRowMax.map_iterator(it);
      
          // Subtract row max from each element and compute exponent
          *it = exp(*it - *row_it);
      }
    • 12:33 - Reuse cooperative tensor as matmul input

      constexpr auto mul_sv_op_desc = matmul2d_descriptor(/* ... */);
      matmul2d<mul_sv_op_desc, metal::execution_simdgroup> mul_sv_op;
      
      if (mul_sv_op.is_compatible_as_left_input<float, half, float>(ctQK)) {
          // Directly reuse cooperative tensor as input
          auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>(ctQK);
          mul_sv_op.run(ctQKIn, tVSlice, ctO);
      } else {
          // Store and reload through threadgroup memory if layout is not compatible
          ctQK.store(tgTensor);
          simdgroup_barrier(mem_flags::mem_threadgroup);
      
          auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>();
          ctQKIn.load(tgTensor);
          mul_sv_op.run(ctQKIn, tVSlice, ctO);
      }
    • 0:00 - Introduction
    • Overview of how Metal tensors and TensorOps enable you to write optimized custom ML kernels on Apple Silicon.

    • 0:21 - Apple's ML software stack
    • A tour of Apple's ML software stack, from high-level frameworks like Core AI and MLX down to Metal Performance Shaders, Metal Performance Primitives, and the TensorOps library — and why you might want to work at the Metal level.

    • 2:25 - Managing quantized data
    • How quantization reduces memory bandwidth requirements for large models, and the new quantized data types natively supported by TensorOps, including MX scaling formats.

    • 4:23 - Multi-plane tensors
    • How a single MTLTensor object can now represent both quantized element data and scale factors as separate planes, and how to configure multi-plane tensor descriptors in your Metal shaders.

    • 5:17 - Quantized matrix multiplication
    • How to extend a tiled matrix multiplication kernel to support quantized inputs, including binding scales planes, using inline tensors, slicing with threadgroup IDs, and handling custom dequantization formats.

    • 9:31 - Building advanced ops
    • How to implement Flash Attention with TensorOps, covering custom SIMD group mappings, cooperative tensors, row reductions, SoftMax, and the new API for passing cooperative tensors directly as matrix multiplication inputs — eliminating the threadgroup memory round-trip.

    • 13:35 - Integrating custom ops into Core AI
    • How to integrate a custom Metal TensorOps kernel into a Core AI application, using Core AI's Python tools to convert PyTorch models and plug in custom Metal operations.

    • 15:25 - Next steps
    • A summary of the TensorOps features covered — quantized types, multi-plane tensors, Flash Attention, and Core AI integration — with pointers to sample code and related sessions on Core AI and Metal.

Developer Footer

  • Vídeos
  • WWDC26
  • Otimize operações personalizadas de aprendizado de máquina com tensores do Metal
  • Open Menu Close Menu
    • iOS
    • iPadOS
    • macOS
    • tvOS
    • visionOS
    • watchOS
    • App Store
    Open Menu Close Menu
    • Swift
    • SwiftUI
    • Swift Playground
    • TestFlight
    • Xcode
    • Xcode Cloud
    • Icon Composer
    • SF Symbols
    Open Menu Close Menu
    • Accessibility
    • Accessories
    • Apple Intelligence
    • Audio & Video
    • Augmented Reality
    • Business
    • Design
    • Distribution
    • Education
    • Games
    • Health & Fitness
    • In-App Purchase
    • Localization
    • Maps & Location
    • Machine Learning & AI
    • Security
    • Safari & Web
    Open Menu Close Menu
    • Documentation
    • Downloads
    • Sample Code
    • Videos
    Open Menu Close Menu
    • Help Guides & Articles
    • Contact Us
    • Forums
    • Feedback & Bug Reporting
    • System Status
    Open Menu Close Menu
    • Apple Developer
    • App Store Connect
    • Certificates, IDs, & Profiles
    • Feedback Assistant
    Open Menu Close Menu
    • Apple Developer Program
    • Apple Developer Enterprise Program
    • App Store Small Business Program
    • MFi Program
    • Mini Apps Partner Program
    • News Partner Program
    • Video Partner Program
    • Security Bounty Program
    • Security Research Device Program
    Open Menu Close Menu
    • Meet with Apple
    • Apple Developer Centers
    • App Store Awards
    • Apple Design Awards
    • Apple Developer Academies
    • WWDC
    Read the latest news.
    Get the Apple Developer app.
    Copyright © 2026 Apple Inc. All rights reserved.
    Terms of Use Privacy Policy Agreements and Guidelines