ML Compute

RSS for tag

Accelerate training and validation of neural networks using the CPU and GPUs.

ML Compute Documentation

Posts under ML Compute tag

40 Posts
Sort by:
Post not yet marked as solved
2 Replies
44 Views
This does not seem to be effecting the training, but it seems somewhat important (no clue on how to read it however): Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG13XFamilyCommandBuffer: 0x29b027b50> label = <none> device = <AGXG13XDevice: 0x12da25600> name = Apple M1 Max commandQueue = <AGXG13XFamilyCommandQueue: 0x106477000> label = <none> device = <AGXG13XDevice: 0x12da25600> name = Apple M1 Max retainedReferences = 1 This is happening during a "heavy" model training on "heavy" dataset, so maybe is related to some memory issue, but I have no clue how to confront it
Posted Last updated
.
Post not yet marked as solved
7 Replies
14k Views
I just got my new MacBook Pro with M1 Max chip and am setting up Python. I've tried several combinational settings to test speed - now I'm quite confused. First put my questions here: Why python run natively on M1 Max is greatly (~100%) slower than on my old MacBook Pro 2016 with Intel i5? On M1 Max, why there isn't significant speed difference between native run (by miniforge) and run via Rosetta (by anaconda) - which is supposed to be slower ~20%? On M1 Max and native run, why there isn't significant speed difference between conda installed Numpy and TensorFlow installed Numpy - which is supposed to be faster? On M1 Max, why run in PyCharm IDE is constantly slower ~20% than run from terminal, which doesn't happen on my old Intel Mac. Evidence supporting my questions is as follows: Here are the settings I've tried: 1. Python installed by Miniforge-arm64, so that python is natively run on M1 Max Chip. (Check from Activity Monitor, Kind of python process is Apple). Anaconda.: Then python is run via Rosseta. (Check from Activity Monitor, Kind of python process is Intel). 2. Numpy installed by conda install numpy: numpy from original conda-forge channel, or pre-installed with anaconda. Apple-TensorFlow: with python installed by miniforge, I directly install tensorflow, and numpy will also be installed. It's said that, numpy installed in this way is optimized for Apple M1 and will be faster. Here is the installation commands: conda install -c apple tensorflow-deps python -m pip install tensorflow-macos python -m pip install tensorflow-metal 3. Run from Terminal. PyCharm (Apple Silicon version). Here is the test code: import time import numpy as np np.random.seed(42) a = np.random.uniform(size=(300, 300)) runtimes = 10 timecosts = [] for _ in range(runtimes): s_time = time.time() for i in range(100): a += 1 np.linalg.svd(a) timecosts.append(time.time() - s_time) print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s') and here are the results: +-----------------------------------+-----------------------+--------------------+ | Python installed by (run on)→ | Miniforge (native M1) | Anaconda (Rosseta) | +----------------------+------------+------------+----------+----------+---------+ | Numpy installed by ↓ | Run from → | Terminal | PyCharm | Terminal | PyCharm | +----------------------+------------+------------+----------+----------+---------+ | Apple Tensorflow | 4.19151 | 4.86248 | / | / | +-----------------------------------+------------+----------+----------+---------+ | conda install numpy | 4.29386 | 4.98370 | 4.10029 | 4.99271 | +-----------------------------------+------------+----------+----------+---------+ This is quite slow. For comparison, run the same code on my old MacBook Pro 2016 with i5 chip - it costs 2.39917s. another post reports that run with M1 chip (not Pro or Max), miniforge+conda_installed_numpy is 2.53214s, and miniforge+apple_tensorflow_numpy is 1.00613s. you may also try on it your own. Here is the CPU information details: My old i5: $ sysctl -a | grep -e brand_string -e cpu.core_count machdep.cpu.brand_string: Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz machdep.cpu.core_count: 2 My new M1 Max: % sysctl -a | grep -e brand_string -e cpu.core_count machdep.cpu.brand_string: Apple M1 Max machdep.cpu.core_count: 10 I follow instructions strictly from tutorials - but why would all these happen? Is it because of my installation flaws, or because of M1 Max chip? Since my work relies heavily on local runs, local speed is very important to me. Any suggestions to possible solution, or any data points on your own device would be greatly appreciated :)
Posted Last updated
.
Post not yet marked as solved
1 Replies
100 Views
Can someone tell me if using copyrighted content for neural network training is infringement or a fair use? For example: Someone took 100000 superhero pictures from Google for training. After this neural network can create superhero pictures with the query of user. Is it an infringement or a fair use? Can developer sell these created pictures to users (or a subscription to service)? Or everyone uses only public domain and open source content for training?
Posted
by Dimbill.
Last updated
.
Post not yet marked as solved
0 Replies
134 Views
Documentation for MPS graph has no information about class and methods functionality. Its only enumerates everything it's got without any explanation what it is and how it works. Why so? https://developer.apple.com/documentation/metalperformanceshadersgraph Though, In MPSGraph header files there is some commentaries, so it's seems like a bug
Posted
by abesmon.
Last updated
.
Post not yet marked as solved
0 Replies
124 Views
I'm trying to run sample code for MPS graph, which I got here: https://developer.apple.com/documentation/metalperformanceshadersgraph/adding_custom_functions_to_a_shader_graph And it's not working. Builds successfully, but after you press train (play button), program fails right after first training iteration with errors like these: -[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit' -[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit' It is failing on commandBuffer.commit() in runTrainingIterationBatch() method. Its like something already committed operation (I've checked and yeah, command buffer is already commited). But why such thing in EXAMPLE CODE? I've tried to wrap commit operation with command buffer status check and it is helping to not fail, but program works wrong overall and not calculating loss well. Everything is getting worse because documentation for MPS Graph is empty! It's contains only class and method names without any description D; My env: Xcode 13.4.1 (13F100) macOS 12.4 MacBook Pro (m1 pro) 14' 2021 16gb Tried to build on iPhone 12 Pro Max / iOS 15.5 and to Mac catalyst application. Got same error everywhere
Posted
by abesmon.
Last updated
.
Post not yet marked as solved
3 Replies
597 Views
I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine. I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably. It looks like the problem might be in some Metal Performance Shaders init code. The state of everything (backtraces, etc.) when the program freezes is attached. Backtraces
Posted
by andmis.
Last updated
.
Post not yet marked as solved
0 Replies
138 Views
We use dynamic input size for some uses cases. When compute unit mode is .all there is strong difference in the execution time if the dynamic input shape doesn’t fit with the optimal shape. If we set the model optimal input shape as 896x896 but run it with an input shape of 1024x768 the execution time is almost twice as slower compared to an input size of 896x896. For example a model set with 896x896 preferred input shape can achieve inference at 66 ms when input shape is 896x896. However this model only achieve inference at 117 ms when input shape is 1024x768. In that case if we want to achieve best performances at inference time we need to switch from a model to another in function of the input shape which is not dynamic at all and memory greedy. There is a way to reduce the execution time when shape out of the preferred shape range?
Posted
by dbphr.
Last updated
.
Post not yet marked as solved
0 Replies
168 Views
We use several CoreML models on our swift application. Memory footprint of these coreML models varies in a range from 15 kB to 3.5 MB according to the XCode coreML utility tool. We observe a huge difference of loading time in function of the type of the compute units selected to run the model. Here is a small sample code used to load the model: let configuration = MLModelConfiguration() //Here I use the the .all compute units mode: configuration.computeUnits = .all let myModel = try! myCoremlModel(configuration: configuration).model Here are the profiling results of this sample code for different models sizes in function of the targeted compute units: Model-3.5-MB : computeUnits is .cpuAndGPU: 188 ms ⇒ 18 MB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 4000 ms ⇒ 875 kB/s Model-2.6-MB: computeUnits is .cpuAndGPU: 144 ms ⇒ 18 MB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 1300 ms ⇒ 2 MB/s Model-15-kB: computeUnits is .cpuAndGPU: 18 ms ⇒ 833 kB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 700 ms ⇒ 22 kB/s What explained the difference of loading time in function en the computeUnits mode ? Is there a way to reduce the loading time of the models when using the .all or .cpuAndNeuralEngine computeUnits mode ?
Posted
by dbphr.
Last updated
.
Post not yet marked as solved
6 Replies
1k Views
After installing tensorflow-metal PluggableDevice according to Getting Started with tensorflow-metal PluggableDevice I have tested this DCGAN example: https://www.tensorflow.org/tutorials/generative/dcgan. Everything was working perfectly until I decided tu upgrade macOS from 12.0.1 to 12.1. Before the final result after 50 epoch was like on picture1 below , after upgrade is like on picture2 below . I am using: TensrofFlow 2.7.0 tensorflow-metal-0.3.0 python3.9 I hope this question will also help Apple to improve Metal PluggableDevice. I can't wait to use it in my research.
Posted Last updated
.
Post marked as solved
2 Replies
293 Views
Hello everyone I found some problem in tf built-in function (tf.signal.stft) when I type the code below, it will cause problem. Device is MacBookPro with M1 Pro chip in jupyterlab However, the problem won't cause on linux with CUDA. Does anyone know how to fix the problem ? Thanks. code: import numpy as np import tensorflow as tf random_waveform = np.random.normal(size=(16000)) tf_waveform = tf.constant(random_waveform) tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128) error message: InvalidArgumentError Traceback (most recent call last) Input In [1], in <cell line: 6>() 4 random_waveform = np.random.normal(size=(16000)) 5 tf_waveform = tf.constant(random_waveform) ----> 6 tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128) File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs) 151 except Exception as e: 152 filtered_tb = _process_traceback_frames(e.__traceback__) --> 153 raise e.with_traceback(filtered_tb) from None 154 finally: 155 del filtered_tb File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name) 7162 def raise_from_not_ok_status(e, name): 7163 e.message += (" name: " + name if name is not None else "") -> 7164 raise core._status_to_exception(e) from None InvalidArgumentError: Multiple Default OpKernel registrations match NodeDef '{{node ZerosLike}}': 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' and 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' [Op:ZerosLike] 1
Posted Last updated
.
Post not yet marked as solved
0 Replies
256 Views
Trying to run python file in vs code but getting error mentioned in the title , I am basically trying to train deep learning model by importing libraries like tensorflow , numpy , pandas , matplotlib etc , I am only getting error Illegal instruction: 4 nothing else and one more thing same code is working fine in windows , plz help
Posted
by himaws72.
Last updated
.
Post not yet marked as solved
0 Replies
254 Views
We are developing a simple GAN an when training the solution, the behavior of the convergence of the discriminator is different if we use GPU than using only CPU or even executing in Collab. We've read a lot, but this is the only one post that seems to talk about similar behavior. Unfortunately, after updating to 0.4 version problem persists. My Hardware/Software: MacBook Pro. model: MacBookPro18,2. Chip: Apple M1 Max. Cores: 10 (8 de rendimiento y 2 de eficiencia). Memory: 64 GB. firmware: 7459.101.3. OS: Monterey 12.3.1. OS Version: 7459.101.3. Python version 3.8 and libraries (the most related) using !pip freeze keras==2.8.0 Keras-Preprocessing==1.1.2 .... tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-datasets==4.5.2 tensorflow-docs @ git+https://github.com/tensorflow/docs@7d5ea2e986a4eae7573be3face00b3cccd4b8b8b tensorflow-macos==2.8.0 tensorflow-metadata==1.7.0 tensorflow-metal==0.4.0 #####. CODE TO REPRODUCE. ####### Code does not fit in the max space in this message... I've shared a Google Collab Notebook at: https://colab.research.google.com/drive/1oDS8EV0eP6kToUYJuxHf5WCZlRL0Ypgn?usp=sharing You can easily see that loss goes to 0 after 1 or 2 epochs when GPU is enabled, buy if GPU is disabled everything is OK
Posted Last updated
.
Post not yet marked as solved
2 Replies
1.8k Views
Im using my 2020 Mac mini with M1 chip and this is the first time try to use it on convolutional neural network training. So the problem is I install the python(ver 3.8.12) using miniforge3 and Tensorflow following this instruction. But still facing the GPU problem when training a 3D Unet. Here's part of my code and hoping to receive some suggestion to fix this. import tensorflow as tf from tensorflow import keras import json import numpy as np import pandas as pd import nibabel as nib import matplotlib.pyplot as plt from tensorflow.keras import backend as K #check available devices def get_available_devices(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos] print(get_available_devices()) Metal device set to: Apple M1 ['/device:CPU:0', '/device:GPU:0'] 2022-02-09 11:52:55.468198: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-02-09 11:52:55.468885: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) X_norm_with_batch_dimension = np.expand_dims(X_norm, axis=0) #tf.device('/device:GPU:0') #Have tried this line doesn't work #tf.debugging.set_log_device_placement(True) #Have tried this line doesn't work patch_pred = model.predict(X_norm_with_batch_dimension) InvalidArgumentError: 2 root error(s) found. (0) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format. [[node model/conv3d/Conv3D (defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231) ]] [[model/conv3d/Conv3D/_4]] (1) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format. [[node model/conv3d/Conv3D (defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231) ]] 0 successful operations. 0 derived errors ignored. The code is executable on Google Colab but can't run on Mac mini locally with Jupyter notebook. The NHWC tensor format problem might indicate that Im using my CPU to execute the code instead of GPU. Is there anyway to optimise GPU to train the network in Tensorflow?
Posted
by MW_Shay.
Last updated
.
Post not yet marked as solved
4 Replies
1.1k Views
I am using a CoreML model from https://github.com/PeterL1n/RobustVideoMatting. I have an M1Macbook13 16G and an M1Max Macbook 16 64G. When "computeUnits" using .all or default, M1Max 16 is much slower than M1 13, finish one prediction time is 0.202 and 0.155. Using .cpuOnly, M1Max 16 is fast a little, time is 0.129 and 0.146. Using .cpuAndGPU, M1Max 16 is much fast than M1 13, time is 0.057 and 0.086. And when I use .all or default, M1Max will appear error messages like this: H11ANEDevice::H11ANEDeviceOpen IOServiceOpen failed result= 0xe00002e2 H11ANEDevice::H11ANEDeviceOpen kH11ANEUserClientCommand_DeviceOpen call failed result=0xe00002bc Error opening LB - status=0xe00002bc.. Skipping LB and retrying But M1 13 doesn't have any errors. So I want to know is this a bug of CoreML or M1Max? My Codes is like this: let config = MLModelConfiguration() config.computeUnits = .all let model = try rvm_mobilenetv3_1920x1080_s0_25_int8_ANE(configuration: config) let image1 = NSImage(named: "test1")?.cgImage(forProposedRect: nil, context: nil, hints: nil) let input = try? rvm_mobilenetv3_1920x1080_s0_25_int8_ANEInput(srcWith:image1!, r1i: MLMultiArray(), r2i: MLMultiArray(), r3i: MLMultiArray(), r4i: MLMultiArray())  _ = try? model.prediction(input: input!)
Posted
by Tinyfool.
Last updated
.
Post not yet marked as solved
1 Replies
435 Views
I need to build a model to add to my app and tried following the Apple docs here. No luck because I get an error that is discussed on this thread on the forum. I'm still not clear on why the error is occurring and can't resolve it. I wonder if CreateML inside Playgrounds is still supported at all? I tried using the CreateML app that you can access through developer tools but it just crashes my Mac (2017 MBP - is it just too much of a brick to use for ML at this point? I should think not because I've recently built and trained relatively simple models using Tensorflow. + Python on this machine, and the classifier I'm trying to make now is really simple and doesn't have a huge dataset).
Posted
by agaS95.
Last updated
.
Post not yet marked as solved
0 Replies
436 Views
I am using the default HelloPhotogrammetry app you guys made: https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/ My system originally did not fit the specs because of a GPU issue to run this command line. To solve this issue I bought the Apple supported eGPU Black Magic to allow the graphics issue to function. Here is the error when I run it despite the eGPU: apply_selection_policy_once: prefer use of removable GPUs (via (null):GPUSelectionPolicy->preferRemovable) I have deduced that there needs to be this with the application running it: https://developer.apple.com/documentation/bundleresources/information_property_list/gpuselectionpolicy I tried modifying the Terminal.plist to the updated value - but there was no luck with it. I believe the CL within Xcode needs to have the updated value -- I need help on that aspect to be able to allow the system to use the eGPU. I did create a PropertyList within the MacOS app and added GPUSelectionPolicy with preferRemovable, and I am still having issues with the same above error. Please advice. Also -- to note, I did try to temporary turn off the Prefer External GPU within Terminal -- and it was doing the processing of the Photogrammetry but it was taking awhile to process (>30 mins plus.) I ended up killing that task. I did have a look at Activity Monitor and I did see that my internal GPU was being used, not my eGPU which is what I am trying to use. Previously -- when I did not have the eGPU plugged in - I would be getting an error saying that my specs did not meet criteria, so it was interesting to see that it assumed my Mac had criteria (which it technically did) it just did processing on the less powerful GPU.
Posted Last updated
.
Post not yet marked as solved
1 Replies
533 Views
I'm trying to implement a pytorch custom layer [grid_sampler] (https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.grid_sample.html) on GPU. Both of its inputs, input and grid can be 5-D. My implementation of encodeToCommandBuffer, which is MLCustomLayer protocol's function, is shown below. According to my current attempts, both value of id<MTLTexture> input and id<MTLTexture> grid don't meet expectations. So i wonder can MTLTexture be used to store 5-D input tensor as inputs of encodeToCommandBuffer? Or can anybody help to show me how to use MTLTexture correctly here? Thanks a lot! - (BOOL)encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer             inputs:(NSArray<id<MTLTexture>> *)inputs            outputs:(NSArray<id<MTLTexture>> *)outputs             error:(NSError * _Nullable *)error {   NSLog(@"Dispatching to GPU");   NSLog(@"inputs count %lu", (unsigned long)inputs.count);   NSLog(@"outputs count %lu", (unsigned long)outputs.count);   id<MTLComputeCommandEncoder> encoder = [commandBuffer       computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];     assert(encoder != nil);       id<MTLTexture> input = inputs[0];   id<MTLTexture> grid = inputs[1];   id<MTLTexture> output = outputs[0];   NSLog(@"inputs shape %lu, %lu, %lu, %lu", (unsigned long)input.width, (unsigned long)input.height, (unsigned long)input.depth, (unsigned long)input.arrayLength);   NSLog(@"grid shape %lu, %lu, %lu, %lu", (unsigned long)grid.width, (unsigned long)grid.height, (unsigned long)grid.depth, (unsigned long)grid.arrayLength);   if (encoder)   {     [encoder setTexture:input atIndex:0];     [encoder setTexture:grid atIndex:1];     [encoder setTexture:output atIndex:2];           NSUInteger wd = grid_sample_Pipeline.threadExecutionWidth;     NSUInteger ht = grid_sample_Pipeline.maxTotalThreadsPerThreadgroup / wd;     MTLSize threadsPerThreadgroup = MTLSizeMake(wd, ht, 1);     MTLSize threadgroupsPerGrid = MTLSizeMake((input.width + wd - 1) / wd, (input.height + ht - 1) / ht, input.arrayLength);     [encoder setComputePipelineState:grid_sample_Pipeline];     [encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];     [encoder endEncoding];         }   else     return NO;   *error = nil;   return YES; }
Posted
by stx-000.
Last updated
.
Post not yet marked as solved
2 Replies
722 Views
When running the same code on my m1 Mac with tensorflow-metal vs in a google collab I see a problem with results. The code: https://colab.research.google.com/drive/13GzSfToUvmmGHaROS-sGCu9mY1n_2FYf?usp=sharing import tensorflow as tf import numpy as np import pandas as pd # Setup model input_shape = (10, 5) model_tst = tf.keras.Sequential() model_tst.add(tf.keras.Input(shape=input_shape)) model_tst.add(tf.keras.layers.LSTM(100, return_sequences=True)) model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid")) model_tst.summary() optimizer = tf.keras.optimizers.Adam(learning_rate=0.01) loss = tf.keras.losses.BinaryCrossentropy(from_logits=False) model_tst.compile( loss=loss, optimizer=optimizer, # metrics=[tf.keras.metrics.BinaryCrossentropy() metrics=["mse" ] ) # Generate step data random_input = np.ones((11, 10, 5)) random_input[:, 8:, :] = 99 # Predictions random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2) random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2) # Compare results diff2 = random_output3 - random_output2 pd.DataFrame(diff2).T Output on Mac: Output on google collab: If I reduce the number of nodes in the LSTM I can get the problem to disappear: import tensorflow as tf import numpy as np import pandas as pd # Setup model input_shape = (10, 5) model_tst = tf.keras.Sequential() model_tst.add(tf.keras.Input(shape=input_shape)) model_tst.add(tf.keras.layers.LSTM(2, return_sequences=True)) model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid")) model_tst.summary() optimizer = tf.keras.optimizers.Adam(learning_rate=0.01) loss = tf.keras.losses.BinaryCrossentropy(from_logits=False) model_tst.compile( loss=loss, optimizer=optimizer, # metrics=[tf.keras.metrics.BinaryCrossentropy() metrics=["mse" ] ) # Generate step data random_input = np.ones((11, 10, 5)) random_input[:, 8:, :] = 99 # Predictions random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2) random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2) # Compare results diff2 = random_output3 - random_output2 pd.DataFrame(diff2).T -> outputs are the same in this case. I guess this has to do with how calculations are getting passed to Apple silicon. Any debugging steps I should try to result this problem? Info: I setup tensor flow using the following steps: https://developer.apple.com/metal/tensorflow-plugin/ When running I get this output showing that the GPU plugins are being used
Posted Last updated
.
Post marked as solved
3 Replies
804 Views
MLCustomLayer implementation always dispatches to CPU instead of GPU Background: I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer member function being defined in the application's binding class for the custom layer. I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++. Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated! Below, I provide a minimal example based off of the Swift example mentioned above. Implementation My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU. PyTorch model to CoreML MLModel transformation For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish. To convert the scripted/traced PyTorch model (called torchscript_model in the listing below) to a CoreML MLModel, I use CoreMLTools (from Python) and then save the model as follows; input_shapes = [[1,64,256,256]] mlmodel = coremltools.converters.convert( torchscript_model, source='pytorch', inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)], add_custom_layers = True, minimum_deployment_target = coremltools.target.iOS14, compute_units = coremltools.ComputeUnit.CPU_AND_GPU, ) mlmodel.save('toy_swish_model.mlmodel') Metal shader I use the same Metal shader function swish from Swish.metal here. MLCustomLayer binding class for Swish MLModel layer I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject and the MLCustomLayer protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel with a custom layer. This is defined as follows; Class definition and resource setup; #import <Foundation/Foundation.h> #include <CoreML/CoreML.h> #import <Metal/Metal.h> @interface ToySwish : NSObject<MLCustomLayer>{} @end @implementation ToySwish{ id<MTLComputePipelineState> swishPipeline; } - (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{    NSError* errorPSO = nil;   id<MTLDevice> device = MTLCreateSystemDefaultDevice();   id<MTLLibrary> defaultlibrary = [device newDefaultLibrary];   id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"];   swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO]; assert(errorPSO == nil);   return self; } - (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{   return YES; } - (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{   return inputShapes; } CPU compute method (this is only shown for completeness); - (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{   NSLog(@"Dispatching to CPU");   for(NSInteger i = 0; i < inputs.count; i++){    NSInteger num_elems = inputs[i].count;    float* input_ptr = (float *) inputs[i].dataPointer;    float* output_ptr = (float *) outputs[i].dataPointer;        for(int j = 0; j < num_elems; j++){     output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j]));    }   }   return YES; } Encode GPU commands to command buffer; Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns. - (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{      NSLog(@"Dispatching to GPU");      id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];   assert(computeEncoder != nil); for(int i = 0; i < inputs.count; i++){      [computeEncoder setComputePipelineState:swishPipeline];   [computeEncoder setTexture:inputs[i] atIndex:0];   [computeEncoder setTexture:outputs[i] atIndex:1];      NSInteger w = swishPipeline.threadExecutionWidth;   NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w;   MTLSize threadGroupSize = MTLSizeMake(w, h, 1);   NSInteger groupWidth = (inputs[0].width    + threadGroupSize.width - 1) / threadGroupSize.width;   NSInteger groupHeight = (inputs[0].height   + threadGroupSize.height - 1) / threadGroupSize.height;   NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth;   MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth);   [computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize];   [computeEncoder endEncoding];    }   return YES; } Run inference for a given input The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits are set to MLComputeUnitsAll as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers. I define a MLDictionaryFeatureProvider object called feature_provider from a NSMutableDictionary of input features (input tensors in this case), and then pass this to the predictionFromFeatures method of my loaded model model as follows; @autoreleasepool { [model predictionFromFeatures:feature_provider error:error]; } This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs method and thus on the CPU, instead of the GPU as expected. I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :) Environment XCode 13.1 iPhone 13 iOS 15.1.1 iOS deployment target 15.0
Posted Last updated
.