View in English

  • Apple 开发者
    • 入门汇总

    探索“入门汇总”

    • 概览
    • 学习
    • Apple Developer Program

    及时了解最新动态

    • 最新动态
    • 开发者你好
    • 平台

    探索“平台”

    • Apple 平台
    • iOS
    • iPadOS
    • macOS
    • Apple tvOS
    • visionOS
    • watchOS
    • App Store

    精选

    • 设计
    • 分发
    • 游戏
    • 配件
    • 网页
    • Home
    • CarPlay 车载
    • 技术

    探索“技术”

    • 概览
    • Xcode
    • Swift
    • SwiftUI

    精选

    • 辅助功能
    • App Intents
    • Apple 智能
    • 游戏
    • 机器学习与 AI
    • 安全性
    • Xcode Cloud
    • 社区

    探索“社区”

    • 概览
    • “与 Apple 会面交流”活动
    • 社区主导的活动
    • 开发者论坛
    • 开源

    精选

    • WWDC
    • Swift Student Challenge
    • 开发者故事
    • App Store 大奖
    • Apple 设计大奖
    • Apple Developer Centers
    • 文档

    探索“文档”

    • 文档库
    • 技术概述
    • 示例代码
    • 《人机界面指南》
    • 视频

    发布说明

    • 精选更新
    • iOS
    • iPadOS
    • macOS
    • watchOS
    • visionOS
    • Apple tvOS
    • Xcode
    • 下载

    探索“下载”

    • 所有下载
    • 操作系统
    • 应用程序
    • 设计资源

    精选

    • Xcode
    • TestFlight
    • 字体
    • SF Symbols
    • Icon Composer
    • 支持

    探索“支持”

    • 概览
    • 帮助指南
    • 开发者论坛
    • “反馈助理”
    • 联系我们

    精选

    • 《开发者账户帮助》
    • 《App 审核指南》
    • 《App Store Connect 帮助》
    • 即将实行的要求
    • 协议和准则
    • 系统状态
  • 快速链接

    • 活动
    • 新闻
    • 论坛
    • 示例代码
    • 视频
 

视频

打开菜单 关闭菜单
  • 专题
  • 所有视频
  • 关于

更多视频

  • 简介
  • 概要
  • 代码
  • 使用 Metal 张量优化自定机器学习运算

    使用 Metal Tensor API 和 Metal Performance Primitives (MPP) Tensor Ops 资源库,解锁强大的机器学习性能。探索如何利用 Apple M5 和 A19 GPU 中的神经网络加速器构建可移植的运算。了解如何为你的 Core AI 应用程序构建自定机器学习内核,以及如何高效地处理量化数据格式并优化 GPU 内存。

    章节

    • 0:00 - Introduction
    • 0:21 - Apple's ML software stack
    • 2:25 - Managing quantized data
    • 4:23 - Multi-plane tensors
    • 5:17 - Quantized matrix multiplication
    • 9:31 - Building advanced ops
    • 13:35 - Integrating custom ops into Core AI
    • 15:25 - Next steps

    资源

    • Running inline ML operations in a shader with Metal 4
    • Machine learning passes
    • Download the Metal Performance Primitives (MPP) Programming Guide
    • Metal Performance Shaders
      • 高清视频
      • 标清视频

    相关视频

    Tech Talks

    • 使用 M5 和 A19 GPU 加速机器学习工作负载

    WWDC25

    • 实现 Metal 4 机器学习与图形应用程序的完美融合
  • 搜索此视频…
    • 3:53 - Create a quantized MTLTensor

      // Creating a tensor with a quantized data type from device
      
      #define RANK 2
      
      MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];
      
      tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
      tensorDesc.usage = MTLTensorUsageCompute;
      
      NSInteger dimensions[RANK] = {NumCols, NumRows};
      tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];
      
      NSError *err = nil;
      id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];
    • 4:48 - Declare a multi-plane tensor with scale factors

      // Creating a tensor with a scales auxiliary plane from device
      
      #define RANK 2
      
      MTLTensorAuxiliaryPlaneDescriptor *planeDesc = [MTLTensorAuxiliaryPlaneDescriptor new];
      planeDesc.dataType = MTLTensorDataTypeMetalFloat8UE8M0;
      
      NSInteger blockFactors[RANK] = {32, 1};
      planeDesc.blockFactors = [[MTLTensorExtents alloc] initWithRank:RANK values:blockFactors];
      
      MTLTensorAuxiliaryPlaneDescriptorMap *auxiliaryPlanes =
          [MTLTensorAuxiliaryPlaneDescriptorMap new];
      [auxiliaryPlanes setDescriptor:planeDesc forPlane:MTLTensorPlaneTypeScales];
      
      MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];
      tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
      tensorDesc.usage = MTLTensorUsageCompute;
      
      NSInteger dimensions[RANK] = {NumCols, NumRows};
      tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];
      tensorDesc.auxiliaryPlanes = auxiliaryPlanes;
      
      NSError *err = nil;
      id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];
    • 6:07 - MSL type aliases for an MXFP8 tensor handle

      // Type aliases for a MXFP8 multi-plane tensor handle
      
      #include <metal_tensor>
      
      using namespace metal;
      
      using scales_plane = tensor_blockwise<tensor_plane_scales,
                                            device metal_fp8_ue8m0_format,
                                            32, 1>;
      
      using mxfp8_tensor = tensor<device metal_fp8_e4m3_format,
                                  dextents<int, 2>,
                                  tensor_handle,
                                  scales_plane>;
      
      kernel void matmul(mxfp8_tensor matrixA [[buffer(0)]],
                         mxfp8_tensor matrixB [[buffer(1)]],
                         tensor<device half, dextents<int, 2>> matrixC [[buffer(2)]])
      {
          // ...
      }
    • 6:51 - Declare an inline MXFP8 tensor on the stack

      // Type aliases for a MXFP8 multi-plane tensor inline
      
      #include <metal_tensor>
      
      using namespace metal;
      
      using scales_plane = tensor_blockwise<tensor_plane_scales,
                                            device metal_fp8_ue8m0_format,
                                            32, 1>;
      
      using mxfp8_tensor_inline = tensor<device metal_fp8_e4m3_format,
                                         dextents<int, 2>,
                                         tensor_inline,
                                         scales_plane>;
      
      // Construct tensor on the stack from buffer pointers
      mxfp8_tensor_inline matrixA(dataBufferA,
                                   dextents<int, 2>(K, M),
                                   array<int, 2>({ 1, K }),
                                   scales_plane(scalesBufferA));
    • 7:19 - Slice tensors and run a quantized matmul

      // Slice the tensors to extract the relevant tile
      auto tA = matrixA.slice(0, tgid.y * TILEM);
      auto tB = matrixB.slice(tgid.x * TILEN, 0);
      auto tC = matrixC.slice(tgid.x * TILEN, tgid.y * TILEM);
      
      // Set up the matmul descriptor
      constexpr auto descriptor = matmul2d_descriptor(TILEM,                  // M
                                                      TILEN,                  // N
                                                      dynamic_length_v<int>,  // K
                                                      false,   // Left matrix transposed
                                                      false);  // Right matrix transposed
      
      matmul2d<descriptor, execution_simdgroups<4>> op;
      
      // Run the op — TensorOps handles dequantization automatically
      op.run(tA, tB, tC);
    • 10:27 - Set up simdgroup-scoped QxK multiplication

      // Setup QxK matrix multiplication op
      constexpr auto mul_qk_op_desc = matmul2d_descriptor(/* ... */);
      matmul2d<mul_qk_op_desc, execution_simdgroups> mul_qk_op;
      
      // Slice Q, K, V
      auto tQSlice = tQ.slice<D, ROWS_PER_SIMD>(0, sgid * ROWS_PER_SIMD);
      auto tKSlice = tK.slice<D, BK>(0, k);
      auto tVSlice = tV.slice<D, BK>(0, k);
      
      // Create cooperative tensor to store tile of QxK
      auto ctQK = mul_qk_op.get_destination_cooperative_tensor<decltype(tQSlice),
                                                               decltype(tKSlice),
                                                               float>();
      
      // Multiply QxK
      mul_qk_op.run(tQSlice, tKSlice, ctQK);
    • 11:18 - Compute row-wise reduction for SoftMax

      // Create a cooperative tensor to store row reduction output
      auto ctTileRowMax = mul_qk_op.get_row_reduction_destination_cooperative_tensor<
                              decltype(tQSlice),
                              decltype(tKSlice),
                              float>();
      
      // Compute max over each row of QxK tile
      reduce_rows(ctQK, ctTileRowMax, reduction_operation::max, -INFINITY);
    • 11:56 - Compute element-wise SoftMax with map_iterator

      // Iterate over elements of QxK tile
      #pragma clang loop unroll(full)
      for (auto it = ctQK.begin(); it != ctQK.end(); it++) {
          // Fetch row max corresponding to this element
          auto row_it = ctRowMax.map_iterator(it);
      
          // Subtract row max from each element and compute exponent
          *it = exp(*it - *row_it);
      }
    • 12:33 - Reuse cooperative tensor as matmul input

      constexpr auto mul_sv_op_desc = matmul2d_descriptor(/* ... */);
      matmul2d<mul_sv_op_desc, metal::execution_simdgroup> mul_sv_op;
      
      if (mul_sv_op.is_compatible_as_left_input<float, half, float>(ctQK)) {
          // Directly reuse cooperative tensor as input
          auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>(ctQK);
          mul_sv_op.run(ctQKIn, tVSlice, ctO);
      } else {
          // Store and reload through threadgroup memory if layout is not compatible
          ctQK.store(tgTensor);
          simdgroup_barrier(mem_flags::mem_threadgroup);
      
          auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>();
          ctQKIn.load(tgTensor);
          mul_sv_op.run(ctQKIn, tVSlice, ctO);
      }
    • 0:00 - Introduction
    • Overview of how Metal tensors and TensorOps enable you to write optimized custom ML kernels on Apple Silicon.

    • 0:21 - Apple's ML software stack
    • A tour of Apple's ML software stack, from high-level frameworks like Core AI and MLX down to Metal Performance Shaders, Metal Performance Primitives, and the TensorOps library — and why you might want to work at the Metal level.

    • 2:25 - Managing quantized data
    • How quantization reduces memory bandwidth requirements for large models, and the new quantized data types natively supported by TensorOps, including MX scaling formats.

    • 4:23 - Multi-plane tensors
    • How a single MTLTensor object can now represent both quantized element data and scale factors as separate planes, and how to configure multi-plane tensor descriptors in your Metal shaders.

    • 5:17 - Quantized matrix multiplication
    • How to extend a tiled matrix multiplication kernel to support quantized inputs, including binding scales planes, using inline tensors, slicing with threadgroup IDs, and handling custom dequantization formats.

    • 9:31 - Building advanced ops
    • How to implement Flash Attention with TensorOps, covering custom SIMD group mappings, cooperative tensors, row reductions, SoftMax, and the new API for passing cooperative tensors directly as matrix multiplication inputs — eliminating the threadgroup memory round-trip.

    • 13:35 - Integrating custom ops into Core AI
    • How to integrate a custom Metal TensorOps kernel into a Core AI application, using Core AI's Python tools to convert PyTorch models and plug in custom Metal operations.

    • 15:25 - Next steps
    • A summary of the TensorOps features covered — quantized types, multi-plane tensors, Flash Attention, and Core AI integration — with pointers to sample code and related sessions on Core AI and Metal.

Developer Footer

  • 视频
  • WWDC26
  • 使用 Metal 张量优化自定机器学习运算
  • 打开菜单 关闭菜单
    • iOS
    • iPadOS
    • macOS
    • Apple tvOS
    • visionOS
    • watchOS
    打开菜单 关闭菜单
    • Swift
    • SwiftUI
    • Swift Playground
    • TestFlight
    • Xcode
    • Xcode Cloud
    • SF Symbols
    打开菜单 关闭菜单
    • 辅助功能
    • 配件
    • Apple 智能
    • App 扩展
    • App Store
    • 音频与视频 (英文)
    • 增强现实
    • 设计
    • 分发
    • 教育
    • 字体 (英文)
    • 游戏
    • 健康与健身
    • App 内购买项目
    • 本地化
    • 地图与位置
    • 机器学习与 AI
    • 开源资源 (英文)
    • 安全性
    • Safari 浏览器与网页 (英文)
    打开菜单 关闭菜单
    • 完整文档 (英文)
    • 部分主题文档 (简体中文)
    • 教程
    • 下载
    • 论坛 (英文)
    • 视频
    打开菜单 关闭菜单
    • 支持文档
    • 联系我们
    • 错误报告
    • 系统状态 (英文)
    打开菜单 关闭菜单
    • Apple 开发者
    • App Store Connect
    • 证书、标识符和描述文件 (英文)
    • 反馈助理
    打开菜单 关闭菜单
    • Apple Developer Program
    • Apple Developer Enterprise Program
    • App Store Small Business Program
    • MFi Program (英文)
    • Mini Apps Partner Program
    • News Partner Program (英文)
    • Video Partner Program (英文)
    • 安全赏金计划 (英文)
    • Security Research Device Program (英文)
    打开菜单 关闭菜单
    • 与 Apple 会面交流
    • Apple Developer Center
    • App Store 大奖 (英文)
    • Apple 设计大奖
    • Apple Developer Academies (英文)
    • WWDC
    阅读最近新闻。
    获取 Apple Developer App。
    版权所有 © 2026 Apple Inc. 保留所有权利。
    使用条款 隐私政策 协议和准则