I'm interested in using CatBoost and XGBoost for some machine learning projects on my Mac, and I was wondering if it's possible to run these algorithms on my GPU(s) to speed up training times.
I have a Mac with an AMD Radeon Pro 5600M and an Intel UHD Graphics 630 GPUs, and I'm running macOS Ventura 13.2.1. I've read that both CatBoost and XGBoost support GPU acceleration, but I'm not sure if this is possible on my system.
Can anyone point me in the right direction for getting started with GPU-accelerated CatBoost/XGBoost on macOS? Are there any specific drivers or tools I need to install, or any other considerations I should be aware of?
Thank you.
Metal Performance Shaders
RSS for tagOptimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.
Posts under Metal Performance Shaders tag
29 Posts
Sort by:
Post
Replies
Boosts
Views
Activity
Hi,
I am training an adversarial auto encoder using PyTorch 2.0.0 on Apple M2 (Ventura 13.1), with conda 23.1.0 as manager.
I encountered this error:
/AppleInternal/Library/BuildRoots/5b8a32f9-5db2-11ed-8aeb-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolutionA14.mm:3967: failed assertion `destination kernel width and filter kernel width mismatch'
/Users/vk/miniconda3/envs/betavae/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
To my knowledge, the code broke down when running self.manual_backward(loss["g_loss"]) this block:
g_opt.zero_grad()
self.manual_backward(loss["g_loss"])
g_opt.step()
The same code run without problems on linux distribution.
Any thoughts on how to fix it are highly appreciated!
I have been experimenting with different rendering approaches in Metal and am hitting a wall when it comes to reconciling "bindless" or GPU-driven approaches* with a dynamic scene where meshes can be added, removed, and changed. All the examples I have found of such approaches use fixed scenes, where all the data is fixed before the first draw call into something like a MeshBuffer that holds all scene geometry in the form of Mesh objects (for instance).
While I can assume that recreating a MeshBuffer from scratch each frame would be possible but completely undesirable, and that there may be some clever tricks with pointers to update a MeshBuffer as needed, I would like to know if there is an established or optimal solution to this problem, or if these approaches are simply incompatible with dynamic geometry. Any example projects that do what I am asking that I may have missed would be appreciated, too.
* I know these are not the same, but seem to share some common characteristics, namely providing your entire geometry to the GPU at once. Looping over an array of meshes and calling drawIndexedPrimitives from the CPU does not post any such obstacles, but also precludes some of the benefits of offloading work to the GPU, or having access to all geometry on the GPU for things like path tracing.
I am learning Accelerating ray tracing using Metal. The area light has its own struct in this sample code, but I want to sample rays directly from the LightMesh. Can I get the instances and geometry of lightMesh without using resources buffer? It seems the geometries are already loaded in the GPU because Metal3 is able to do the intersection test. However, I can only get primitive_data during the intersection, and cannot get the information when I tried to do sampling.
Thanks a lot!
I am following this https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu on building a metal app for performing a GPU calculation.
I am not able to figure out how to build and execute the project from the command line. Any help on how to build a main.m file using xcrun will be useful.
I have tried xcrun -sdk macosx clang MetalComputeBasic/main.m but it doesn't work.
Hello, I am doing ray tracing and plan to do multiple intersection tests with different rays in one kernel (shading loop). It works fine when I have two intersection tests, but will cause gpu break down when there are three intersection tests. Is there some rules that I need to obey? Thanks.
In the video here, the speaker refers to MPSGraphTool, which is supposed to convert from CoreML and other formats to the new MPSGraphPackage format.
Searching for MPSGraphTool on Google returns only that video, and there is no mention of it on the forums here or elsewhere.
When can we expect the tool to be released? How can we find out more information about it?
My use case is that the ANECompilerService that runs on the Mac / iOS devices to compile CoreML Models / Programs is extremely slow and unreliable for large models. It often crashes entirely, sitting at 100% CPU usage forever and never completing the task at hand, meaning the user is stuck in a loading state. This also applies in Xcode when running a performance test. I would really like to compile the graph once and just run it on device directly.
Hello,
I am trying to use gpu for machine learning task from apple using "mps" as device for GPU but it is not working. I am using PyTorch Stable version. How can I use MacBook GPU for machine learning tasks?
I have a following MTLBuffer created. How can I send INPUTVALUE to the memINPUT buffer? I need to send repeatedly in Objective-C.
// header file
@property id<MTLBuffer> memINPUT;
// main file
int length = 1000;
...
memINPUT = [_device newBufferWithLength:(sizeof(float)*length) options:0];
...
float INPUTVALUE[length];
for (int i=0; i < length; i++) {
INPUTVALUE[i] = (float)i;
}
// How to send to INPUTVALUE to memINPUT?
...
The following is Swift version. I am looking for Objective-c version.
memINPUT.contents().copyMemory(from: INPUTVALUE, byteCount: length * MemoryLayout<Float>.stride);
I'm trying to implement de-noising of AO in my app, using the MPSDynamicScene example as a guide: https://developer.apple.com/documentation/metalperformanceshaders/animating_and_denoising_a_raytraced_scene
In that example, it computes motion vectors in UV coordinates, resulting in very small values:
// Compute motion vectors
if (uniforms.frameIndex > 0) {
// Map current pixel location to 0..1
float2 uv = in.position.xy / float2(uniforms.width, uniforms.height);
// Unproject the position from the previous frame then transform it from
// NDC space to 0..1
float2 prevUV = in.prevPosition.xy / in.prevPosition.w * float2(0.5f, -0.5f) + 0.5f;
// Next, remove the jittering which was applied for antialiasing from both
// sets of coordinates
uv -= uniforms.jitter;
prevUV -= prevUniforms.jitter;
// Then the motion vector is simply the difference between the two
motionVector = uv - prevUV;
}
Yet the documentation for MPSSVGF seems to indicate the offsets should be expressed in texels:
The motion vector texture must be at least a two channel texture representing how many texels
* each texel in the source image(s) have moved since the previous frame. The remaining channels
* will be ignored if present. This texture may be nil, in which case the motion vector is assumed
* to be zero, which is suitable for static images.
Is this a mistake in the example code?
Asking because doing something similarly in my own app leaves AO trails, which would indicate the motion vector texture values are too small in magnitude. I don't really see trails in the example, even when I speed up the animation, but that could be due to the example being monochrome.
Update:
If I multiply the uv offsets by the size of the texture, I get a bad result. Which seems to indicate the header is misleading and they are in fact in uv coordinates. So perhaps the trails I'm seeing in my app are for some other reason.
I also wonder who is actually using this API other than me? I would think most game engines are doing their own thing. Perhaps some of apple's own code uses it.
I am in the process of developing a matrix-vector multiplication kernel. While conducting performance evaluations, I've noticed that on M1/M1 Pro/M1 Max, the kernel demonstrates an impressive memory bandwidth utilization of around 90%. However, when executed on the M1 Ultra/M2 Ultra, this figure drops to approximately 65%. My suspicion is that this discrepancy is attributed to the dual-die architecture of the M1 Ultra/M2 Ultra. It's plausible that the necessary data might be stored within the L2 cache of the alternate die.
Could you kindly provide any insights or recommendations for mitigating the occurrence of on-die L2 cache misses on the Ultra chips? Additionally, I would greatly appreciate any general advice aimed at enhancing memory load speeds on these particular chips.
with my MacBook m2.
The code works correctly both on CPU and GPU, but the speed on GPU is much slower!
I have loaded my statistic and my model on GPU, and it seemed to work.
/Users/guoyijun/Desktop/iShot_2023-08-20_09.57.41.png
I printed my code runtime. when the following function "train" is called, the loop speed among them runs extraordinarily slow.
def train(net, device, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
train_ls, test_ls = [], []
train_iter = d2l.load_array((train_features, train_labels), batch_size, device)
# Adam
optimizer = torch.optim.Adam(net.parameters(), lr = learning_rate, weight_decay = weight_decay)
for epoch in range(num_epochs):
for X, y in train_iter:
optimizer.zero_grad()
l = loss(net(X), y)
l.backward()
optimizer.step() #
train_ls.append(log_rmse(net, train_features, train_labels))
return train_ls, test_ls
Hello.
I'm working with Metal in Apple Vision Pro, and I've assumed that I can use Mesh shaders to work with Meshlets. But when creating the RenderPipeline, I get the following error message: "device does not support mesh shaders". The test is on the simulator, and my question is: Will Apple Vision Pro support Mesh shaders on physical devices?
Thanks.
Hi,
I am trying to extend the pytorch library. I would like to add MPS native Cholesky Decomposition. I finally got it working (mostly).
But I am struggling to implement the status codes.
What I did:
// init status
id<MTLBuffer> status = [device newBufferWithLength:sizeof(int) options:MTLResourceStorageModeShared];
if (status) {
int* statusPtr = (int*)[status contents];
*statusPtr = 42; // Set the initial content to 42
NSLog(@"Status Value: %d", *statusPtr);
}
else {
NSLog(@"Failed to allocate status buffer");
}
...
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> commandBuffer) {
// Your completion code here
int* statusPtr = (int*)[status contents];
int statusVal = *statusPtr;
NSLog(@"Status Value: %d", statusVal);
// Update the 'info' tensor here based on statusVal
// ...
}];
for (const auto i : c10::irange(batchSize)) {
...
[filter encodeToCommandBuffer:commandBuffer
sourceMatrix:sourceMatrix
resultMatrix:solutionMatrix
status:status];
}
(full code here: https://github.com/pytorch/pytorch/blob/ab6a550f35be0fdbb58b06ff8bfda1ab0cc236d0/aten/src/ATen/native/mps/operations/LinearAlgebra.mm)
But this code prints the following when input with a non positive definite tensor:
2023-09-02 19:06:24.167 python[11777:2982717] Status Value: 42
2023-09-02 19:06:24.182 python[11777:2982778] Status Value: 0
initial tensor: tensor([[-0.0516, 0.7090, 0.9474],
[ 0.8520, 0.3647, -1.5575],
[ 0.5346, -0.3149, 1.9950]], device='mps:0')
L: tensor([[-0.0516, 0.0000, 0.0000],
[ 0.8520, -0.3612, 0.0000],
[ 0.5346, -0.3149, 1.2689]], device='mps:0')
What am I doing wrong? Why do I get a 0 (success) status even tough the matrix is not positive definite.
Thank you in advance!
Hello everyone! I have a small concern about one little thing when it comes to programming in metal. There are some models that I wish to use along with animations and skins on them, the file extension for them is called gltf. glTF has been used in a number of projects such as unity and unreal engine and godot and blender. I was wondering if metal supports this file extension or not. Anyone here knows the answer?
Hello,
I've been working on an app that involves training a neural network model on the iPhone. I've been using the Metal Performance Shaders Graph (MPS Graph) for this purpose. In the training process the loss becomes Nan on iOS17 (21A329).
I noticed that the official sample code for Training a Neural Network using MPS Graph (link) works perfectly fine on Xcode 14.3.1 with iOS 16.6.1. However, when I run the same code on Xcode 15.0 beta 8 with iOS 17.0 (21A329), the training process produces a NaN loss in function updateProgressCubeAndLoss. The official sample code and my own app exhibit the same issue.
Has anyone else experienced this issue? Is this a known bug, or is there something specific that needs to be adjusted for iOS 17?
Any guidance would be greatly appreciated.
Thank you!
Hello - I have been struggling to find a solution online and I hope you can help me timely. I have installed the latest tesnorflow and tensorflow-metal, I even went to install the ternsorflow-nightly. My app generates the following as a result of my fit function on a CNN model with 8 layers.
2023-09-29 22:21:06.115768: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2023-09-29 22:21:06.115846: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2023-09-29 22:21:06.116048: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2023-09-29 22:21:06.116264: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-29 22:21:06.116483: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
Most importantly, the learning process is very slow and I'd like to take advantage of al the new features of the latest versions. What can I do?
I only get this error when using the JAX Metal device (CPU is fine). It seems to be a problem whenever I want to modify values of an array in-place using at and set.
note: see current operation:
%2903 = "mhlo.scatter"(%arg3, %2902, %2893) ({
^bb0(%arg4: tensor<f32>, %arg5: tensor<f32>):
"mhlo.return"(%arg5) : (tensor<f32>) -> ()
}) {indices_are_sorted = true, scatter_dimension_numbers = #mhlo.scatter<update_window_dims = [0, 1], inserted_window_dims = [1], scatter_dims_to_operand_dims = [1]>, unique_indices = true} : (tensor<10x100x4xf32>, tensor<1xsi32>, tensor<10x4xf32>) -> tensor<10x100x4xf32>
blocks = blocks.at[i].set(
...
I'm experimenting with Vision OS and Apple Vision Pro using the Xcode Beta. I'm using Xcode 15.1 Beta and visionOS 1.0 beta 4.
I'm currently doing a project where I draw a polygon using a mesh generated from MeshDescriptor/MeshResource and present it in an ImmersiveView.
I want to change the color of parts, i.e. not all of, my 3D rendered polygon and I want to do it dynamically. For example when the user presses a button.
I have gotten into Shaders and the CustomMaterial from RealityKit, only to find out that CustomMaterial is not supported on Vision OS!
Does anyone know how I can color portions/parts of a mesh that is generated from MeshDescriptor and MeshResource?
device: iphone 11 os: ios 15.6
I have a metal applicaton on IOS where a series of computer shaders are encoded, then disptached and comiited together at last. When I capture a GPU trace of my application, however I noticed there are these gaps between each computer shader invocation. And these gaps seem to take up a big part of the GPU time.
I'm wondering what are these gaps and what are causing them. Since all compute dispatch commands are commiited toghether at once, these gaps shouldn't be synchronizations between cpu and GPU
PS: In my application, later compute commands mostly depend on former ones and would use the result buffer from former invocations. But as shown in the picture, bandwith and read/write buffer limiter are not high as far as I'm concerned.