Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Posts under Metal Performance Shaders tag

29 Posts
Sort by:

Post

Replies

Boosts

Views

Activity

Can I run CatBoost/XGBoost on my GPU(s) on my Mac?
I'm interested in using CatBoost and XGBoost for some machine learning projects on my Mac, and I was wondering if it's possible to run these algorithms on my GPU(s) to speed up training times. I have a Mac with an AMD Radeon Pro 5600M and an Intel UHD Graphics 630 GPUs, and I'm running macOS Ventura 13.2.1. I've read that both CatBoost and XGBoost support GPU acceleration, but I'm not sure if this is possible on my system. Can anyone point me in the right direction for getting started with GPU-accelerated CatBoost/XGBoost on macOS? Are there any specific drivers or tools I need to install, or any other considerations I should be aware of? Thank you.
1
0
1.9k
Sep ’23
MPSNDArrayConvolutionA14.mm:3967: failed assertion `destination kernel width and filter kernel width mismatch'
Hi, I am training an adversarial auto encoder using PyTorch 2.0.0 on Apple M2 (Ventura 13.1), with conda 23.1.0 as manager. I encountered this error: /AppleInternal/Library/BuildRoots/5b8a32f9-5db2-11ed-8aeb-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolutionA14.mm:3967: failed assertion `destination kernel width and filter kernel width mismatch' /Users/vk/miniconda3/envs/betavae/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown To my knowledge, the code broke down when running self.manual_backward(loss["g_loss"]) this block: g_opt.zero_grad() self.manual_backward(loss["g_loss"]) g_opt.step() The same code run without problems on linux distribution. Any thoughts on how to fix it are highly appreciated!
2
0
1.3k
Jul ’23
Bindless/GPU-Driven approach with dynamic scenes?
I have been experimenting with different rendering approaches in Metal and am hitting a wall when it comes to reconciling "bindless" or GPU-driven approaches* with a dynamic scene where meshes can be added, removed, and changed. All the examples I have found of such approaches use fixed scenes, where all the data is fixed before the first draw call into something like a MeshBuffer that holds all scene geometry in the form of Mesh objects (for instance). While I can assume that recreating a MeshBuffer from scratch each frame would be possible but completely undesirable, and that there may be some clever tricks with pointers to update a MeshBuffer as needed, I would like to know if there is an established or optimal solution to this problem, or if these approaches are simply incompatible with dynamic geometry. Any example projects that do what I am asking that I may have missed would be appreciated, too. * I know these are not the same, but seem to share some common characteristics, namely providing your entire geometry to the GPU at once. Looping over an array of meshes and calling drawIndexedPrimitives from the CPU does not post any such obstacles, but also precludes some of the benefits of offloading work to the GPU, or having access to all geometry on the GPU for things like path tracing.
3
1
1.2k
Jun ’23
How to sample a Mesh in Metal Ray-tracing Structure.
I am learning Accelerating ray tracing using Metal. The area light has its own struct in this sample code, but I want to sample rays directly from the LightMesh. Can I get the instances and geometry of lightMesh without using resources buffer? It seems the geometries are already loaded in the GPU because Metal3 is able to do the intersection test. However, I can only get primitive_data during the intersection, and cannot get the information when I tried to do sampling. Thanks a lot!
2
1
840
Jun ’23
Build and execute metal app which perform calcuations on gpu without using xcode
I am following this https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu on building a metal app for performing a GPU calculation. I am not able to figure out how to build and execute the project from the command line. Any help on how to build a main.m file using xcrun will be useful. I have tried xcrun -sdk macosx clang MetalComputeBasic/main.m but it doesn't work.
0
0
670
Jun ’23
Where is MPSGraphTool?
In the video here, the speaker refers to MPSGraphTool, which is supposed to convert from CoreML and other formats to the new MPSGraphPackage format. Searching for MPSGraphTool on Google returns only that video, and there is no mention of it on the forums here or elsewhere. When can we expect the tool to be released? How can we find out more information about it? My use case is that the ANECompilerService that runs on the Mac / iOS devices to compile CoreML Models / Programs is extremely slow and unreliable for large models. It often crashes entirely, sitting at 100% CPU usage forever and never completing the task at hand, meaning the user is stuck in a loading state. This also applies in Xcode when running a performance test. I would really like to compile the graph once and just run it on device directly.
1
0
850
Jul ’23
Transfering data to Metal MTLBuffer dynamically in Objective-c
I have a following MTLBuffer created. How can I send INPUTVALUE to the memINPUT buffer? I need to send repeatedly in Objective-C. // header file @property id<MTLBuffer> memINPUT; // main file int length = 1000; ... memINPUT = [_device newBufferWithLength:(sizeof(float)*length) options:0]; ... float INPUTVALUE[length]; for (int i=0; i < length; i++) { INPUTVALUE[i] = (float)i; } // How to send to INPUTVALUE to memINPUT? ... The following is Swift version. I am looking for Objective-c version. memINPUT.contents().copyMemory(from: INPUTVALUE, byteCount: length * MemoryLayout<Float>.stride);
1
0
567
Jul ’23
is the MPSDynamicScene example correctly computing the motion vector texture?
I'm trying to implement de-noising of AO in my app, using the MPSDynamicScene example as a guide: https://developer.apple.com/documentation/metalperformanceshaders/animating_and_denoising_a_raytraced_scene In that example, it computes motion vectors in UV coordinates, resulting in very small values: // Compute motion vectors if (uniforms.frameIndex > 0) { // Map current pixel location to 0..1 float2 uv = in.position.xy / float2(uniforms.width, uniforms.height); // Unproject the position from the previous frame then transform it from // NDC space to 0..1 float2 prevUV = in.prevPosition.xy / in.prevPosition.w * float2(0.5f, -0.5f) + 0.5f; // Next, remove the jittering which was applied for antialiasing from both // sets of coordinates uv -= uniforms.jitter; prevUV -= prevUniforms.jitter; // Then the motion vector is simply the difference between the two motionVector = uv - prevUV; } Yet the documentation for MPSSVGF seems to indicate the offsets should be expressed in texels: The motion vector texture must be at least a two channel texture representing how many texels * each texel in the source image(s) have moved since the previous frame. The remaining channels * will be ignored if present. This texture may be nil, in which case the motion vector is assumed * to be zero, which is suitable for static images. Is this a mistake in the example code? Asking because doing something similarly in my own app leaves AO trails, which would indicate the motion vector texture values are too small in magnitude. I don't really see trails in the example, even when I speed up the animation, but that could be due to the example being monochrome. Update: If I multiply the uv offsets by the size of the texture, I get a bad result. Which seems to indicate the header is misleading and they are in fact in uv coordinates. So perhaps the trails I'm seeing in my app are for some other reason. I also wonder who is actually using this API other than me? I would think most game engines are doing their own thing. Perhaps some of apple's own code uses it.
0
0
543
Aug ’23
Maximize memory read bandwidth on M1 Ultra/M2 Ultra
I am in the process of developing a matrix-vector multiplication kernel. While conducting performance evaluations, I've noticed that on M1/M1 Pro/M1 Max, the kernel demonstrates an impressive memory bandwidth utilization of around 90%. However, when executed on the M1 Ultra/M2 Ultra, this figure drops to approximately 65%. My suspicion is that this discrepancy is attributed to the dual-die architecture of the M1 Ultra/M2 Ultra. It's plausible that the necessary data might be stored within the L2 cache of the alternate die. Could you kindly provide any insights or recommendations for mitigating the occurrence of on-die L2 cache misses on the Ultra chips? Additionally, I would greatly appreciate any general advice aimed at enhancing memory load speeds on these particular chips.
0
0
614
Aug ’23
Why my neural network is slower on MPS(Apple Silicon) than on CPU
with my MacBook m2. The code works correctly both on CPU and GPU, but the speed on GPU is much slower! I have loaded my statistic and my model on GPU, and it seemed to work. /Users/guoyijun/Desktop/iShot_2023-08-20_09.57.41.png I printed my code runtime. when the following function "train" is called, the loop speed among them runs extraordinarily slow. def train(net, device, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): train_ls, test_ls = [], [] train_iter = d2l.load_array((train_features, train_labels), batch_size, device) # Adam optimizer = torch.optim.Adam(net.parameters(), lr = learning_rate, weight_decay = weight_decay) for epoch in range(num_epochs): for X, y in train_iter: optimizer.zero_grad() l = loss(net(X), y) l.backward() optimizer.step() # train_ls.append(log_rmse(net, train_features, train_labels)) return train_ls, test_ls
0
0
645
Aug ’23
Meshlet
Hello. I'm working with Metal in Apple Vision Pro, and I've assumed that I can use Mesh shaders to work with Meshlets. But when creating the RenderPipeline, I get the following error message: "device does not support mesh shaders". The test is on the simulator, and my question is: Will Apple Vision Pro support Mesh shaders on physical devices? Thanks.
1
0
553
Aug ’23
MPSMatrixDecompositionCholesky Status code
Hi, I am trying to extend the pytorch library. I would like to add MPS native Cholesky Decomposition. I finally got it working (mostly). But I am struggling to implement the status codes. What I did: // init status id<MTLBuffer> status = [device newBufferWithLength:sizeof(int) options:MTLResourceStorageModeShared]; if (status) { int* statusPtr = (int*)[status contents]; *statusPtr = 42; // Set the initial content to 42 NSLog(@"Status Value: %d", *statusPtr); } else { NSLog(@"Failed to allocate status buffer"); } ... [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> commandBuffer) { // Your completion code here int* statusPtr = (int*)[status contents]; int statusVal = *statusPtr; NSLog(@"Status Value: %d", statusVal); // Update the 'info' tensor here based on statusVal // ... }]; for (const auto i : c10::irange(batchSize)) { ... [filter encodeToCommandBuffer:commandBuffer sourceMatrix:sourceMatrix resultMatrix:solutionMatrix status:status]; } (full code here: https://github.com/pytorch/pytorch/blob/ab6a550f35be0fdbb58b06ff8bfda1ab0cc236d0/aten/src/ATen/native/mps/operations/LinearAlgebra.mm) But this code prints the following when input with a non positive definite tensor: 2023-09-02 19:06:24.167 python[11777:2982717] Status Value: 42 2023-09-02 19:06:24.182 python[11777:2982778] Status Value: 0 initial tensor: tensor([[-0.0516, 0.7090, 0.9474], [ 0.8520, 0.3647, -1.5575], [ 0.5346, -0.3149, 1.9950]], device='mps:0') L: tensor([[-0.0516, 0.0000, 0.0000], [ 0.8520, -0.3612, 0.0000], [ 0.5346, -0.3149, 1.2689]], device='mps:0') What am I doing wrong? Why do I get a 0 (success) status even tough the matrix is not positive definite. Thank you in advance!
0
0
513
Sep ’23
Metal API supported files for models?
Hello everyone! I have a small concern about one little thing when it comes to programming in metal. There are some models that I wish to use along with animations and skins on them, the file extension for them is called gltf. glTF has been used in a number of projects such as unity and unreal engine and godot and blender. I was wondering if metal supports this file extension or not. Anyone here knows the answer?
3
1
1.2k
Sep ’23
MPS Graph Neural Network Training Produces NaN Loss on Xcode 15.0 beta 8 + iOS 17.0
Hello, I've been working on an app that involves training a neural network model on the iPhone. I've been using the Metal Performance Shaders Graph (MPS Graph) for this purpose. In the training process the loss becomes Nan on iOS17 (21A329). I noticed that the official sample code for Training a Neural Network using MPS Graph (link) works perfectly fine on Xcode 14.3.1 with iOS 16.6.1. However, when I run the same code on Xcode 15.0 beta 8 with iOS 17.0 (21A329), the training process produces a NaN loss in function updateProgressCubeAndLoss. The official sample code and my own app exhibit the same issue. Has anyone else experienced this issue? Is this a known bug, or is there something specific that needs to be adjusted for iOS 17? Any guidance would be greatly appreciated. Thank you!
1
0
664
Oct ’23
TensorFlow is slow after upgrading to Sonoma
Hello - I have been struggling to find a solution online and I hope you can help me timely. I have installed the latest tesnorflow and tensorflow-metal, I even went to install the ternsorflow-nightly. My app generates the following as a result of my fit function on a CNN model with 8 layers. 2023-09-29 22:21:06.115768: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro 2023-09-29 22:21:06.115846: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB 2023-09-29 22:21:06.116048: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB 2023-09-29 22:21:06.116264: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-09-29 22:21:06.116483: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) Most importantly, the learning process is very slow and I'd like to take advantage of al the new features of the latest versions. What can I do?
7
2
1.9k
Oct ’23
JAX Metal error: failed to legalize operation 'mhlo.scatter'
I only get this error when using the JAX Metal device (CPU is fine). It seems to be a problem whenever I want to modify values of an array in-place using at and set. note: see current operation: %2903 = "mhlo.scatter"(%arg3, %2902, %2893) ({ ^bb0(%arg4: tensor<f32>, %arg5: tensor<f32>): "mhlo.return"(%arg5) : (tensor<f32>) -> () }) {indices_are_sorted = true, scatter_dimension_numbers = #mhlo.scatter<update_window_dims = [0, 1], inserted_window_dims = [1], scatter_dims_to_operand_dims = [1]>, unique_indices = true} : (tensor<10x100x4xf32>, tensor<1xsi32>, tensor<10x4xf32>) -> tensor<10x100x4xf32> blocks = blocks.at[i].set( ...
6
5
996
Nov ’23
Vision OS and coloring entities using Metal Shaders
I'm experimenting with Vision OS and Apple Vision Pro using the Xcode Beta. I'm using Xcode 15.1 Beta and visionOS 1.0 beta 4. I'm currently doing a project where I draw a polygon using a mesh generated from MeshDescriptor/MeshResource and present it in an ImmersiveView. I want to change the color of parts, i.e. not all of, my 3D rendered polygon and I want to do it dynamically. For example when the user presses a button. I have gotten into Shaders and the CustomMaterial from RealityKit, only to find out that CustomMaterial is not supported on Vision OS! Does anyone know how I can color portions/parts of a mesh that is generated from MeshDescriptor and MeshResource?
2
0
898
Oct ’23
What are the gaps between individual computer shader invocations in metal GPU trace?
device: iphone 11 os: ios 15.6 I have a metal applicaton on IOS where a series of computer shaders are encoded, then disptached and comiited together at last. When I capture a GPU trace of my application, however I noticed there are these gaps between each computer shader invocation. And these gaps seem to take up a big part of the GPU time. I'm wondering what are these gaps and what are causing them. Since all compute dispatch commands are commiited toghether at once, these gaps shouldn't be synchronizations between cpu and GPU PS: In my application, later compute commands mostly depend on former ones and would use the result buffer from former invocations. But as shown in the picture, bandwith and read/write buffer limiter are not high as far as I'm concerned.
2
0
574
Oct ’23