Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Posts under Metal Performance Shaders tag

21 Posts
Sort by:

Post

Replies

Boosts

Views

Activity

Example Usage of sliceUpdateDataTensor
Where can I find an example of using this MPSGraph function? I'm trying to use it to paste an image into a larger canvas at certain coordinates. func sliceUpdateDataTensor( _ dataTensor: MPSGraphTensor, update updateTensor: MPSGraphTensor, starts: [NSNumber], ends: [NSNumber], strides: [NSNumber], startMask: UInt32, endMask: UInt32, squeezeMask: UInt32, name: String? ) -> MPSGraphTensor
0
0
141
1w
Metal Compute Overhead
Hello, We are experimenting with Metal to accelerate some peculiar numerical computation. Our workloads are relatively small, so the ability to avoid moving data to and from the GPU's memory is very appealing. However, we are observing higher overhead compared to CUDA, which negates the benefits of avoiding data transfer. In our tests using an empty kernel, CUDA completes in 0.001 ms (Intel i7 10700K, RTX 3080), while Metal's waitUntilCompleted takes 0.12 ms (M2 Max). As we do not have prior experience with Metal, we are wondering if we are using the APIs just fine and this timing is expected, or if there is a way to reduce it. Thank you in advance for any comment! test-metal.cpp
0
0
195
2w
Metal Inline Functions
Hi! How to define and call an inline function in Metal? Or simple function that will return some value. Case: inline uint index4D(constant _4D& shape, constant uint& n, constant uint& c, constant uint& h, constant uint& w) { return n * shape.C * shape.H * shape.W + c * shape.H * shape.W + h * shape.W + w; } When I call it in my kernel function I get No matching function for call error. Thx in advance.
2
0
176
2w
Normally distributed MPSMatrixRandom number generation generates NaN
When generating large arrays of random numbers, NaNs show up. They also show up at the same indices when using the same seed, leading me to believe that this is a bug with MPSMatrixRandom's normally distributed Float32 random number distribution. Happens with both Philox and MTGP32. Is this intentional and how do I work around this? See the original post for a MWE in Swift and Julia: https://github.com/JuliaGPU/Metal.jl/issues/474
0
1
178
3w
Generating procedural textures sample code error.
Screenshot: Specific error message: validateComputeFunctionArguments:1149: failed assertion `Compute Function(textureShader): Shader uses texture(texture[0]) as read-write, but hardware does not support read-write texture of this pixel format.' OS: visionOS 2.1 (22N5548c) simulator. Link: https://developer.apple.com/documentation/visionos/generating-procedural-textures-in-visionos
1
0
231
Oct ’24
Issue of viewing MPSGraph compiled for iOS platform
We convert a .onnx file to mpsgraphpackage for iOS deploymentPlatform with command “Mpsgraphtool convert -deploymentPlatform iOS -minimumDeploymentTarget17.0.0 model.onnx -path .” When open output.mpsgraphpackage with Xcode16, there are only “generic” and “ Apple M2(MTLDevice)” options in the “Device” selection list. Cannot find any option for iOS device. How can we view mpsgraph compiled for iOS platform? We use Xcode16 on a MacBook Pro M2 with macOS 15.
0
0
204
Oct ’24
Many inputs to `MPSNNGraph::encodeBatchToCommandBuffer`
I understand we can use MPSImageBatch as input to [MPSNNGraph encodeBatchToCommandBuffer: ...] method. That being said, all inputs to the MPSNNGraph need to be encapsulated in a MPSImage(s). Suppose I have an machine learning application that trains/infers on thousands of input data where each input has 4 feature channels. Metal Performance Shaders is chosen as the primary AI backbone for real-time use. Due to the nature of encodeBatchToCommandBuffer method, I will have to create a MTLTexture first as a 2D texture array. The texture has pixel width of 1, height of 1 and pixel format being RGBA32f. The general set up will be: #define NumInputDims 4 MPSImageBatch * infBatch = @[]; const uint32_t totalFeatureSets = N; // Each slice is 4 (RGBA) channels. const uint32_t totalSlices = (totalFeatureSets * NumInputDims + 3) / 4; MTLTextureDescriptor * descriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat: MTLPixelFormatRGBA32Float width: 1 height: 1 mipmapped: NO]; descriptor.textureType = MTLTextureType2DArray descriptor.arrayLength = totalSlices; id<MTLTexture> texture = [mDevice newTextureWithDescriptor: descriptor]; // bytes per row is `4 * sizeof(float)` since we're doing one pixel of RGBA32F. [texture replaceRegion: MTLRegionMake3D(0, 0, 0, 1, 1, totalSlices) mipmapLevel: 0 withBytes: inputFeatureBuffers[0].data() bytesPerRow: 4 * sizeof(float)]; MPSImage * infQueryImage = [[MPSImage alloc] initWithTexture: texture featureChannels: NumInputDims]; infBatch = [infBatch arrayByAddingObject: infQueryImage]; The training/inference will be: MPSNNGraph * mInferenceGraph = /*some MPSNNGraph setup*/; MPSImageBatch * returnImage = [mInferenceGraph encodeBatchToCommandBuffer: commandBuffer sourceImages: @[infBatch] sourceStates: nil intermediateImages: nil destinationStates: nil]; // Commit and wait... // Read the return image for the inferred result. As you can see, the setup is really ad hoc - a lot of 1x1 pixels just for this sole purpose. Is there any better way I can achieve the same result while still on Metal Performance Shaders? I guess a further question will be: can MPS handle general machine learning cases other than CNN? I can see the APIs are revolved around convolution network, both from online documentations and header files. Any response will be helpful, thank you.
0
0
263
Oct ’24
Why is the speed of metal shading kernel so slow?
Hi, I am recently writing metal shader language to parallelize the algorithms to accelerate the speed of it. I created a simple example to show the acceleration result of it. Since Rust is used in our algorithm, so I used metal-rs as the wrapper to execute the MSL kernels from rust side. In this example, I am calculating the result of two arrays, and kernel looks like: kernel void two_array_addition_2( constant uint* a [[buffer(0)]], constant uint* b [[buffer(1)]], device uint* c [[buffer(2)]], uint idx [[thread_position_in_grid]] ) { c[idx] = a[idx] + b[idx]; } in the main.rs, you can see a function called execute_kernel() , this function has all it needs to execute the kernel in MSL (such as commandEncoder, piplelineState, etc). use core::mem; use metal::{Buffer, MTLSize}; use objc::rc::autoreleasepool; use std::time::Instant; use two_array_addition::abstractions::state::MetalState; fn execute_kernel( name: &str, state: &MetalState, input_a: &Buffer, input_b: &Buffer, output_c: &Buffer, ) -> Vec<u32> { // assert!(input_a.len() == input_b.len() && input_a.len() == output_c.len()); // let len = input_a.len() as u64; let len = input_a.length() as u64 / mem::size_of::<u32>() as u64; // 1. Init the MetalState // - we inited it // 2. Set up Pipeline State let pipeline = state.setup_pipeline(name).unwrap(); // 3. Allocate the buffers for A, B, and C // - we allocated outside of this function let mut result: &[u32] = &[]; autoreleasepool(|| { // 4. Create the command buffer & command encoder let (command_buffer, command_encoder) = state.setup_command( &pipeline, Some(&[(0, input_a), (1, input_b), (2, output_c)]), ); // 5. command encoder dispatch the threadgroup size and num of threads per threadgroup let threadgroup_count = MTLSize::new((len + 256 - 1) / 256, 1, 1); let thread_per_threadgroup = MTLSize::new(256, 1, 1); // let grid_size = MTLSize::new(len, 1, 1); // let threadgroup_count = MTLSize::new(pipeline.max_total_threads_per_threadgroup(), 1, 1); command_encoder.dispatch_thread_groups(threadgroup_count, thread_per_threadgroup); command_encoder.end_encoding(); command_buffer.commit(); command_buffer.wait_until_completed(); // 6. Copy the result back to the host let start = Instant::now(); result = MetalState::retrieve_contents::<u32>(output_c); let duration = start.elapsed(); println!("Duration for copying result back to host: {:?}", duration); }); result.to_vec() } The performance of the result is kinda interesting to me. This is the result: $ cargo run -r This is expected to run for a while... please wait... Generating input arrays... Generating input arrays... Generating output array... Generating expected output... Duration for allocating buffers: 2.015258s Executing 1st kernel (1)... Duration for copying result back to host: 5.75µs Executing 1st kernel (2)... Duration for copying result back to host: 542ns Executing 2nd kernel (1)... Duration for copying result back to host: 1µs Executing 2nd kernel (2)... Duration for copying result back to host: 458ns Duration expected: 183.406167ms Duration for 1st kernel (1): 1.894994875s Duration for 1st kernel (2): 537.318208ms Duration for 2nd kernel (1): 501.33275ms Duration for 2nd kernel (2): 497.339916ms You have successfully run the kernels! The speed is slower when executing in the MSL kernel, while I reckon of the dataset is quite big ($2^{29}$) The first kernel execution takes more time to launch. Is there any way to optimize the MSL in this case? And in most case, when you design the algorithm into parallelism, what would be the concerns? The machine I am using is M1 Pro with 14-core GPU and 16 GB memory. Does anyone have idea / explanation for why these happen? Thank you
1
0
333
Sep ’24
To use ARSCNView to capture a 3D model of a scene and obtain the mesh information, how can I retrieve the texture information for the mesh?
arScnView = ARSCNView(frame: CGRect.zero, options: nil) arScnView.delegate = self arScnView.automaticallyUpdatesLighting = true arScnView.allowsCameraControl = true addSubview(arScnView) arSession = arScnView.session arSession.delegate = self config = ARWorldTrackingConfiguration() config.sceneReconstruction = .meshWithClassification config.environmentTexturing = .automatic func session(_ session: ARSession, didAdd anchors: [ARAnchor]) { anchors.forEach({ anchor in if let meshAnchor = anchor as? ARMeshAnchor { let node = meshAnchor.toSCNNode() self.arScnView.scene.rootNode.addChildNode(node) } if let environmentProbeAnchor = anchor as? AREnvironmentProbeAnchor { // Can I retrieve the texture map corresponding to ARMeshAnchor from Environment Probe Anchor? // Or how can I retrieve the texture map corresponding to ARMeshAnchor? } }) } How can I scan a 3D scene and save it as USDZ? I want to achieve the following scenario?
0
0
348
Sep ’24
Metal Performance Shader color issue with yCbCr buffer
I'm making an app that reads a ProRes file, processes each frame through metal to resize and scale it, then outputs a new ProRes file. In the future the app will support other codecs but for now just ProRes. I'm reading the ProRes 422 buffers in the kCVPixelFormatType_422YpCbCr16 pixel format. This is what's recommended by Apple in this video https://developer.apple.com/wwdc20/10090?time=599. When the MTLTexture is run through a metal performance shader, the colorspace seems to force RGB or is just not allowing yCbCr textures as the output is all green/purple. If you look at the render code, you will see there's a commented out block of code to just blit copy the outputTexture, if you perform the copy instead of the scaling through MPS, the output colorspace is fine. So it appears the issue is from Metal Performance Shaders. Side note - I noticed that when using this format, it brings in the YpCbCr texture as a single plane. I thought it's preferred to handle this as two separate planes? That said, if I have two separate planes, that makes my app more complicated as I would need to scale both planes or merge it to RGB. But I'm going for the most performance possible. A sample project can be found here: https://www.dropbox.com/scl/fo/jsfwh9euc2ns2o3bbmyhn/AIomDYRhxCPVaWw9XH-qaN0?rlkey=sp8g0sb86af1u44p3xy9qa3b9&dl=0 Inside the supporting files, there is a test movie. For ease, I would move this to somewhere easily accessible (i.e Desktop). Load and run the example project. Click 'Select Video' Select that video you placed on your desktop It will now output a new video next to the selected one, named "Output.mov" The new video should just be scaled at 50%, but the colorspace is all wrong. Below is a photo of before and after the metal performance shader.
3
0
457
Aug ’24
MacOS 15 Beta3: Metal Shader with newLibraryWithSource didn't work if the executable path contains Chinese character.
Here is the test code run in a macOS app (MacOS 15 Beta3). If the excutable path does not contain Chinese character, every thing go as We expect. Otherwise(simply place excutable in a Chinese named directory) , the MTLLibrary We made by newLibraryWithSource: function contains no functions, We just got logs: "Library contains the following functions: {}" "Function 'squareKernel' not found." Note: macOS 14 works fine id<MTLDevice> device = MTLCreateSystemDefaultDevice(); if (!device) { NSLog(@"not support Metal."); } NSString *shaderSource = @ "#include <metal_stdlib>\n" "using namespace metal;\n" "kernel void squareKernel(device float* data [[buffer(0)]], uint gid [[thread_position_in_grid]]) {\n" " data[gid] *= data[gid];\n" "}"; MTLCompileOptions *options = [[MTLCompileOptions alloc] init]; options.languageVersion = MTLLanguageVersion2_0; NSError *error = nil; id<MTLLibrary> library = [device newLibraryWithSource:shaderSource options:options error:&error]; if (error) { NSLog(@"New MTLLibrary error: %@", error); } NSArray<NSString *> *functionNames = [library functionNames]; NSLog(@"Library contains the following functions: %@", functionNames); id<MTLFunction> computeShaderFunction = [library newFunctionWithName:@"squareKernel"]; if (computeShaderFunction) { NSLog(@"Found function 'squareKernel'."); NSError *pipelineError = nil; id<MTLComputePipelineState> pipelineState = [device newComputePipelineStateWithFunction:computeShaderFunction error:&pipelineError]; if (pipelineError) { NSLog(@"Create pipeline state error: %@", pipelineError); } NSLog(@"Create pipeline state succeed!"); } else { NSLog(@"Function 'squareKernel' not found."); }
3
5
783
Jul ’24
Unity IOS shader vector array bug
Unity 2022.3.33f1 For some reason modifying MeshRenderer material shader SetVectorArray doesn't work on IOS, but it works on android and windows builds!! I was working on Fog Of War, where I used SimpleFOW by Revision3 it's very simple FOW shader where it manipulates the alpha based on UVs vertices. This is the FogOfWarShaderControl.cs script using System.Collections; using System.Collections.Generic; using UnityEngine; namespace SimpleFOW { [RequireComponent(typeof(MeshRenderer))] public class FogOfWarShaderControl : MonoBehaviour { public static FogOfWarShaderControl Instance { get; private set; } [Header("Maximum amount of revealing points")] [SerializeField] private uint maximumPoints = 512; [Header("Game Camera")] [SerializeField] private Camera mainCamera; private List<Vector4> points = new List<Vector4>(); private Vector2 meshSize, meshExtents; private Vector4[] sendBuffer; private MeshRenderer meshRenderer; private void Awake() { Instance = this; Init(); } // Initialize required variables public void Init() { meshRenderer = GetComponent<MeshRenderer>(); meshExtents = meshRenderer.bounds.extents; meshSize = meshRenderer.bounds.size; points = new List<Vector4>(); sendBuffer = new Vector4[maximumPoints]; } // Transform world point to UV coordinate of FOW mesh public Vector2 WorldPointToMeshUV(Vector2 wp) { Vector2 toRet = Vector2.zero; toRet.x = (transform.position.x - wp.x + meshExtents.x) / meshSize.x; toRet.y = (transform.position.y - wp.y + meshExtents.y) / meshSize.y; return toRet; } // Show or hide FOW public void SetEnabled(bool on) { meshRenderer.enabled = on; } // Add revealing point to FOW renderer if amount of points is lower than MAX_POINTS public void AddPoint(Vector2 worldPoint) { if (points.Count < maximumPoints) { points.Add(WorldPointToMeshUV(worldPoint)); } } // Remove FOW revealing point public void RemovePoint(Vector2 worldPoint) { if (worldPoint == new Vector2(0, 0)) { return; } if (points.Contains(WorldPointToMeshUV(worldPoint))) { points.Remove(WorldPointToMeshUV(worldPoint)); } } // Send any change to revealing point list to shader for rendering public void SendPoints() { points.ToArray().CopyTo(sendBuffer, 0); meshRenderer.material.SetVectorArray("_PointArray", sendBuffer); meshRenderer.material.SetInt("_PointCount", points.Count); } // Send new range value to shader public void SendRange(float range) { meshRenderer.material.SetFloat("_RadarRange", range); } // Send new scale value to shader public void SendScale(float scale) { meshRenderer.material.SetFloat("_Scale", scale); } } } And this is the FogOfWar.shader Shader "Revision3/FogOfWar" { Properties { _MainTex ("Texture", 2D) = "black" {} _PointCount("Point count", Range(0,512)) = 0 _Scale("Scale", Float) = 1.0 _RadarRange("Range", Float) = .5 _MaxAlpha("Maximum Alpha", Float) = 1.0 } SubShader { Tags { "RenderType"="Transparent" "Queue"="Transparent" } LOD 100 ZWrite Off Blend SrcAlpha OneMinusSrcAlpha Pass { CGPROGRAM #pragma vertex vert #pragma fragment frag // make fog work #pragma multi_compile_fog #include "UnityCG.cginc" struct appdata { float4 vertex : POSITION; float2 uv : TEXCOORD0; }; struct v2f { float2 uv : TEXCOORD0; UNITY_FOG_COORDS(1) float4 vertex : SV_POSITION; }; sampler2D _MainTex; float4 _MainTex_ST; float _RadarRange; uint _PointCount; float _Scale; float _MaxAlpha; float2 _PointArray[512]; v2f vert (appdata v) { v2f o; o.vertex = UnityObjectToClipPos(v.vertex); o.uv = TRANSFORM_TEX(v.uv, _MainTex); return o; } float getDistance(float2 pa[512], float2 uv) { float cdist = 99999.0; for (uint i = 0; i < _PointCount; i++) { cdist = min(cdist, distance(pa[i]*_Scale, uv)); } return cdist; } fixed4 frag (v2f i) : SV_Target { // sample the texture fixed4 col = tex2D(_MainTex, i.uv); i.uv *= _Scale; if (_PointCount > 0) col.w = min(_MaxAlpha, max(0.0f, getDistance(_PointArray, i.uv) - _RadarRange)); else col.w = _MaxAlpha; return col; } ENDCG } } } Now I create a gameobject called FogOfWar as follows And then in Unit.cs script and Building.cs script I add the following logic private Vector3 lastPos; private void Update() { if (lastPos != transform.position) { FogOfWarShaderControl.Instance.RemovePoint(lastPos); FogOfWarShaderControl.Instance.AddPoint(transform.position); lastPos = transform.position; FogOfWarShaderControl.Instance.SendPoints(); } } Now this gives me the effect of FOW as follows on IOS where the result should be as follows on other devices I don't know what causes this to happen only on IOS devices. The logic works fine on android/windows/Linux/editor but not IOS devices. So why metal API doesn't support shader set vector array?!
0
0
551
Jul ’24
Tips on getting started with volumetric rendering?
I'm brand new to Metal. I've googled, but can't get the right answer to come up. (Thanks, unhelpful ChatGPT generated answers polluting everything, but I digress...) Ultimately, I'm trying to figure out how to use Metal to render 3D DICOM data on iOS specifically. If you're not familiar with DICOM, let's just say I've got a whole stack of CT image slices. Or to get really simple, I've got a cube of voxel values with differing values at each voxel coordinate. Where do I even start in Metal to render something like this? (I was trying to get the VTK toolkit compiled for iOS, which uses OpenGL, but that appears to be a dead end. And besides, Metal is supposed to be so much better.) Thanks for any tips/leads/suggestions/general pointers.
2
0
722
Apr ’24
MPSNNGraph: use custom compute/render metal during training?
Hello, I have been following the excellent/informative "Metal for Machine Learning" from WWDC19 to learn how to do on device training (I have a specific use case for this) and it is all working really well using the MPSNNGraph. However, I would like to call my own metal compute/render function/pipeline to transform the inference result before calculating the loss, does anyone know if this possible and what would this look like in code? Please see my current code below, at the comment I need to call an intermediate compute/render function to transform the inference result image before passing to the MPSNNForwardLossNode. let rgbImageNode = MPSNNImageNode(handle: nil) let inferGraph = makeInferenceGraph() let reshape = MPSNNReshapeNode(source: inferGraph.resultImage, resultWidth: 64, resultHeight: 64, resultFeatureChannels: 4) //Need to call render or compute pipeline to post process in the inference result image let rgbLoss = MPSNNForwardLossNode(source:reshape.resultImage, labels:rgbImageNode, lossDescriptor:lossDescriptor) let initGrad = MPSNNInitialGradientNode(source:rgbLoss.resultImage) let gradNodes = initGrad.trainingGraph(withSourceGradient:nil, nodeHandler:nil) guard let trainGraph = MPSNNGraph(device: device, resultImage: gradNodes![0].resultImage, resultImageIsNeeded: true) else{ fatalError("Unable to get training graph.") } Thanks
0
0
674
Mar ’24
Unsupported method: -[MTLComputeCommandEncoder. encodeStartWhile: offset: comparison: referenceValue:]
it appears that the Metal Debugging interface does not support this method, at least the function hashing algorithm does not have a pattern for it in the symbol dictionary as presented. Where do we get updated C- libraries and functions that sync with the things that are presented in the Demo Kits and Samples that Apple puts in the user domain? Why does this stuff get out into the wild insufficiently tested? It seems thet the demo kits made available to users should be included in the test domain used to verify new code releases. I came from a development environment where the 6 month release cycle involved automated execution of the test suite before it went beta or anywhere else.
1
0
712
Feb ’24
Encountering Error in MetalPerformanceShaders: Assertion Failure in MPSNDArray.mm
Hello, I'm currently facing an issue while working with MetalPerformanceShaders in testing a Python project. Code: https://github.com/Thinklab- SJTU/Crossformer Error Description: /AppleInternal/Library/BuildRoots/4e1473ee-9f66-11ee-8daf-cedaeb4cabe2/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:126: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[2] (15)' I've tried updating macOS to the latest version. Also, I've attempted running the code with MPS_NO_DEVICE_CHECK=1 to bypass device checks. But errors happened again. I'm seeking insights or solutions to this problem and If anyone has encountered a similar issue or has suggestions on how to troubleshoot and resolve this assertion failure, I would greatly appreciate your help. Project Details: Language: Python Environment: PyCharm CE Relevant Technologies: MetalPerformanceShaders, MPSNDArray Thank you for your time and assistance!
2
1
737
Feb ’24