Post not yet marked as solved
Documentation for MPS graph has no information about class and methods functionality. Its only enumerates everything it's got without any explanation what it is and how it works. Why so?
https://developer.apple.com/documentation/metalperformanceshadersgraph
Though, In MPSGraph header files there is some commentaries, so it's seems like a bug
Post not yet marked as solved
I'm trying to run sample code for MPS graph, which I got here: https://developer.apple.com/documentation/metalperformanceshadersgraph/adding_custom_functions_to_a_shader_graph
And it's not working. Builds successfully, but after you press train (play button), program fails right after first training iteration with errors like these:
-[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit'
-[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit'
It is failing on commandBuffer.commit() in runTrainingIterationBatch() method.
Its like something already committed operation (I've checked and yeah, command buffer is already commited). But why such thing in EXAMPLE CODE?
I've tried to wrap commit operation with command buffer status check and it is helping to not fail, but program works wrong overall and not calculating loss well.
Everything is getting worse because documentation for MPS Graph is empty! It's contains only class and method names without any description D;
My env:
Xcode 13.4.1 (13F100)
macOS 12.4
MacBook Pro (m1 pro) 14' 2021 16gb
Tried to build on iPhone 12 Pro Max / iOS 15.5 and to Mac catalyst application. Got same error everywhere
Post not yet marked as solved
I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine.
I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably.
It looks like the problem might be in some Metal Performance Shaders init code.
The state of everything (backtraces, etc.) when the program freezes is attached.
Backtraces
Post not yet marked as solved
We use dynamic input size for some uses cases. When compute unit mode is .all there is strong difference in the execution time if the dynamic input shape doesn’t fit with the optimal shape. If we set the model optimal input shape as 896x896 but run it with an input shape of 1024x768 the execution time is almost twice as slower compared to an input size of 896x896.
For example a model set with 896x896 preferred input shape can achieve inference at 66 ms when input shape is 896x896. However this model only achieve inference at 117 ms when input shape is 1024x768.
In that case if we want to achieve best performances at inference time we need to switch from a model to another in function of the input shape which is not dynamic at all and memory greedy. There is a way to reduce the execution time when shape out of the preferred shape range?
Post not yet marked as solved
We use several CoreML models on our swift application. Memory footprint of these coreML models varies in a range from 15 kB to 3.5 MB according to the XCode coreML utility tool. We observe a huge difference of loading time in function of the type of the compute units selected to run the model.
Here is a small sample code used to load the model:
let configuration = MLModelConfiguration()
//Here I use the the .all compute units mode:
configuration.computeUnits = .all
let myModel = try! myCoremlModel(configuration: configuration).model
Here are the profiling results of this sample code for different models sizes in function of the targeted compute units:
Model-3.5-MB :
computeUnits is .cpuAndGPU: 188 ms ⇒ 18 MB/s
computeUnits is .all or .cpuAndNeuralEngine on iOS16: 4000 ms ⇒ 875 kB/s
Model-2.6-MB:
computeUnits is .cpuAndGPU: 144 ms ⇒ 18 MB/s
computeUnits is .all or .cpuAndNeuralEngine on iOS16: 1300 ms ⇒ 2 MB/s
Model-15-kB:
computeUnits is .cpuAndGPU: 18 ms ⇒ 833 kB/s
computeUnits is .all or .cpuAndNeuralEngine on iOS16: 700 ms ⇒ 22 kB/s
What explained the difference of loading time in function en the computeUnits mode ? Is there a way to reduce the loading time of the models when using the .all or .cpuAndNeuralEngine computeUnits mode ?
Post not yet marked as solved
After installing tensorflow-metal PluggableDevice according to Getting Started with tensorflow-metal PluggableDevice I have tested this DCGAN example: https://www.tensorflow.org/tutorials/generative/dcgan. Everything was working perfectly until I decided tu upgrade macOS from 12.0.1 to 12.1. Before the final result after 50 epoch was like on picture1 below
, after upgrade is like on picture2 below
.
I am using:
TensrofFlow 2.7.0
tensorflow-metal-0.3.0
python3.9
I hope this question will also help Apple to improve Metal PluggableDevice. I can't wait to use it in my research.
Hello everyone
I found some problem in tf built-in function (tf.signal.stft)
when I type the code below, it will cause problem.
Device is MacBookPro with M1 Pro chip in jupyterlab
However, the problem won't cause on linux with CUDA.
Does anyone know how to fix the problem ?
Thanks.
code:
import numpy as np
import tensorflow as tf
random_waveform = np.random.normal(size=(16000))
tf_waveform = tf.constant(random_waveform)
tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128)
error message:
InvalidArgumentError Traceback (most recent call last)
Input In [1], in <cell line: 6>()
4 random_waveform = np.random.normal(size=(16000))
5 tf_waveform = tf.constant(random_waveform)
----> 6 tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128)
File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb
File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name)
7162 def raise_from_not_ok_status(e, name):
7163 e.message += (" name: " + name if name is not None else "")
-> 7164 raise core._status_to_exception(e) from None
InvalidArgumentError: Multiple Default OpKernel registrations match NodeDef '{{node ZerosLike}}': 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' and 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' [Op:ZerosLike]
1
Post not yet marked as solved
Trying to run python file in vs code but getting error mentioned in the title , I am basically trying to train deep learning model by importing libraries like tensorflow , numpy , pandas , matplotlib etc , I am only getting error Illegal instruction: 4 nothing else and one more thing same code is working fine in windows , plz help
Post not yet marked as solved
We are developing a simple GAN an when training the solution, the behavior of the convergence of the discriminator is different if we use GPU than using only CPU or even executing in Collab.
We've read a lot, but this is the only one post that seems to talk about similar behavior.
Unfortunately, after updating to 0.4 version problem persists.
My Hardware/Software: MacBook Pro. model: MacBookPro18,2. Chip: Apple M1 Max. Cores: 10 (8 de rendimiento y 2 de eficiencia). Memory: 64 GB. firmware: 7459.101.3. OS: Monterey 12.3.1. OS Version: 7459.101.3.
Python version 3.8 and libraries (the most related) using !pip freeze
keras==2.8.0 Keras-Preprocessing==1.1.2 .... tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-datasets==4.5.2 tensorflow-docs @ git+https://github.com/tensorflow/docs@7d5ea2e986a4eae7573be3face00b3cccd4b8b8b tensorflow-macos==2.8.0 tensorflow-metadata==1.7.0 tensorflow-metal==0.4.0
#####. CODE TO REPRODUCE. ####### Code does not fit in the max space in this message... I've shared a Google Collab Notebook at:
https://colab.research.google.com/drive/1oDS8EV0eP6kToUYJuxHf5WCZlRL0Ypgn?usp=sharing
You can easily see that loss goes to 0 after 1 or 2 epochs when GPU is enabled, buy if GPU is disabled everything is OK
Post not yet marked as solved
I tried to install dependencies with PDM for a project and found the inconvenience that TensorFlow does not detect the GPU of the M1 Pro. When I create a virtual environment with poetry I do not have the same problem.
Any hint on how to solve this inconvenience?
Post not yet marked as solved
Im using my 2020 Mac mini with M1 chip and this is the first time try to use it on convolutional neural network training.
So the problem is I install the python(ver 3.8.12) using miniforge3 and Tensorflow following this instruction. But still facing the GPU problem when training a 3D Unet.
Here's part of my code and hoping to receive some suggestion to fix this.
import tensorflow as tf
from tensorflow import keras
import json
import numpy as np
import pandas as pd
import nibabel as nib
import matplotlib.pyplot as plt
from tensorflow.keras import backend as K
#check available devices
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
print(get_available_devices())
Metal device set to: Apple M1
['/device:CPU:0', '/device:GPU:0']
2022-02-09 11:52:55.468198: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-09 11:52:55.468885: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
X_norm_with_batch_dimension = np.expand_dims(X_norm, axis=0)
#tf.device('/device:GPU:0') #Have tried this line doesn't work
#tf.debugging.set_log_device_placement(True) #Have tried this line doesn't work
patch_pred = model.predict(X_norm_with_batch_dimension)
InvalidArgumentError: 2 root error(s) found.
(0) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format.
[[node model/conv3d/Conv3D
(defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231)
]] [[model/conv3d/Conv3D/_4]]
(1) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format.
[[node model/conv3d/Conv3D
(defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231) ]]
0 successful operations.
0 derived errors ignored.
The code is executable on Google Colab but can't run on Mac mini locally with Jupyter notebook. The NHWC tensor format problem might indicate that Im using my CPU to execute the code instead of GPU.
Is there anyway to optimise GPU to train the network in Tensorflow?
Post not yet marked as solved
I am using a CoreML model from https://github.com/PeterL1n/RobustVideoMatting.
I have an M1Macbook13 16G and an M1Max Macbook 16 64G.
When "computeUnits" using .all or default, M1Max 16 is much slower than M1 13, finish one prediction time is 0.202 and 0.155.
Using .cpuOnly, M1Max 16 is fast a little, time is 0.129 and 0.146.
Using .cpuAndGPU, M1Max 16 is much fast than M1 13, time is 0.057 and 0.086.
And when I use .all or default, M1Max will appear error messages like this:
H11ANEDevice::H11ANEDeviceOpen IOServiceOpen failed result= 0xe00002e2
H11ANEDevice::H11ANEDeviceOpen kH11ANEUserClientCommand_DeviceOpen call failed result=0xe00002bc
Error opening LB - status=0xe00002bc.. Skipping LB and retrying
But M1 13 doesn't have any errors.
So I want to know is this a bug of CoreML or M1Max?
My Codes is like this:
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try rvm_mobilenetv3_1920x1080_s0_25_int8_ANE(configuration: config)
let image1 = NSImage(named: "test1")?.cgImage(forProposedRect: nil, context: nil, hints: nil)
let input = try? rvm_mobilenetv3_1920x1080_s0_25_int8_ANEInput(srcWith:image1!, r1i: MLMultiArray(), r2i: MLMultiArray(), r3i: MLMultiArray(), r4i: MLMultiArray())
_ = try? model.prediction(input: input!)
Post not yet marked as solved
I just got my new MacBook Pro with M1 Max chip and am setting up Python. I've tried several combinational settings to test speed - now I'm quite confused. First put my questions here:
Why python run natively on M1 Max is greatly (~100%) slower than on my old MacBook Pro 2016 with Intel i5?
On M1 Max, why there isn't significant speed difference between native run (by miniforge) and run via Rosetta (by anaconda) - which is supposed to be slower ~20%?
On M1 Max and native run, why there isn't significant speed difference between conda installed Numpy and TensorFlow installed Numpy - which is supposed to be faster?
On M1 Max, why run in PyCharm IDE is constantly slower ~20% than run from terminal, which doesn't happen on my old Intel Mac.
Evidence supporting my questions is as follows:
Here are the settings I've tried:
1. Python installed by
Miniforge-arm64, so that python is natively run on M1 Max Chip. (Check from Activity Monitor, Kind of python process is Apple).
Anaconda.: Then python is run via Rosseta. (Check from Activity Monitor, Kind of python process is Intel).
2. Numpy installed by
conda install numpy: numpy from original conda-forge channel, or pre-installed with anaconda.
Apple-TensorFlow: with python installed by miniforge, I directly install tensorflow, and numpy will also be installed. It's said that, numpy installed in this way is optimized for Apple M1 and will be faster. Here is the installation commands:
conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal
3. Run from
Terminal.
PyCharm (Apple Silicon version).
Here is the test code:
import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10
timecosts = []
for _ in range(runtimes):
s_time = time.time()
for i in range(100):
a += 1
np.linalg.svd(a)
timecosts.append(time.time() - s_time)
print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')
and here are the results:
+-----------------------------------+-----------------------+--------------------+
| Python installed by (run on)→ | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → | Terminal | PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
| Apple Tensorflow | 4.19151 | 4.86248 | / | / |
+-----------------------------------+------------+----------+----------+---------+
| conda install numpy | 4.29386 | 4.98370 | 4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+
This is quite slow. For comparison,
run the same code on my old MacBook Pro 2016 with i5 chip - it costs 2.39917s.
another post reports that run with M1 chip (not Pro or Max), miniforge+conda_installed_numpy is 2.53214s, and miniforge+apple_tensorflow_numpy is 1.00613s.
you may also try on it your own.
Here is the CPU information details:
My old i5:
$ sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz
machdep.cpu.core_count: 2
My new M1 Max:
% sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Apple M1 Max
machdep.cpu.core_count: 10
I follow instructions strictly from tutorials - but why would all these happen? Is it because of my installation flaws, or because of M1 Max chip? Since my work relies heavily on local runs, local speed is very important to me. Any suggestions to possible solution, or any data points on your own device would be greatly appreciated :)
Post not yet marked as solved
I need to build a model to add to my app and tried following the Apple docs here.
No luck because I get an error that is discussed on this thread on the forum. I'm still not clear on why the error is occurring and can't resolve it.
I wonder if CreateML inside Playgrounds is still supported at all? I tried using the CreateML app that you can access through developer tools but it just crashes my Mac (2017 MBP - is it just too much of a brick to use for ML at this point? I should think not because I've recently built and trained relatively simple models using Tensorflow. + Python on this machine, and the classifier I'm trying to make now is really simple and doesn't have a huge dataset).
Post not yet marked as solved
I am using the default HelloPhotogrammetry app you guys made: https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/
My system originally did not fit the specs because of a GPU issue to run this command line. To solve this issue I bought the Apple supported eGPU Black Magic to allow the graphics issue to function. Here is the error when I run it despite the eGPU: apply_selection_policy_once: prefer use of removable GPUs (via (null):GPUSelectionPolicy->preferRemovable)
I have deduced that there needs to be this with the application running it: https://developer.apple.com/documentation/bundleresources/information_property_list/gpuselectionpolicy
I tried modifying the Terminal.plist to the updated value - but there was no luck with it. I believe the CL within Xcode needs to have the updated value -- I need help on that aspect to be able to allow the system to use the eGPU.
I did create a PropertyList within the MacOS app and added GPUSelectionPolicy with preferRemovable, and I am still having issues with the same above error. Please advice.
Also -- to note, I did try to temporary turn off the Prefer External GPU within Terminal -- and it was doing the processing of the Photogrammetry but it was taking awhile to process (>30 mins plus.) I ended up killing that task. I did have a look at Activity Monitor and I did see that my internal GPU was being used, not my eGPU which is what I am trying to use. Previously -- when I did not have the eGPU plugged in - I would be getting an error saying that my specs did not meet criteria, so it was interesting to see that it assumed my Mac had criteria (which it technically did) it just did processing on the less powerful GPU.
Post not yet marked as solved
I'm trying to implement a pytorch custom layer [grid_sampler] (https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.grid_sample.html) on GPU. Both of its inputs, input and grid can be 5-D. My implementation of encodeToCommandBuffer, which is MLCustomLayer protocol's function, is shown below. According to my current attempts, both value of id<MTLTexture> input and id<MTLTexture> grid don't meet expectations. So i wonder can MTLTexture be used to store 5-D input tensor as inputs of encodeToCommandBuffer? Or can anybody help to show me how to use MTLTexture correctly here? Thanks a lot!
- (BOOL)encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer
inputs:(NSArray<id<MTLTexture>> *)inputs
outputs:(NSArray<id<MTLTexture>> *)outputs
error:(NSError * _Nullable *)error {
NSLog(@"Dispatching to GPU");
NSLog(@"inputs count %lu", (unsigned long)inputs.count);
NSLog(@"outputs count %lu", (unsigned long)outputs.count);
id<MTLComputeCommandEncoder> encoder = [commandBuffer
computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
assert(encoder != nil);
id<MTLTexture> input = inputs[0];
id<MTLTexture> grid = inputs[1];
id<MTLTexture> output = outputs[0];
NSLog(@"inputs shape %lu, %lu, %lu, %lu", (unsigned long)input.width, (unsigned long)input.height, (unsigned long)input.depth, (unsigned long)input.arrayLength);
NSLog(@"grid shape %lu, %lu, %lu, %lu", (unsigned long)grid.width, (unsigned long)grid.height, (unsigned long)grid.depth, (unsigned long)grid.arrayLength);
if (encoder)
{
[encoder setTexture:input atIndex:0];
[encoder setTexture:grid atIndex:1];
[encoder setTexture:output atIndex:2];
NSUInteger wd = grid_sample_Pipeline.threadExecutionWidth;
NSUInteger ht = grid_sample_Pipeline.maxTotalThreadsPerThreadgroup / wd;
MTLSize threadsPerThreadgroup = MTLSizeMake(wd, ht, 1);
MTLSize threadgroupsPerGrid = MTLSizeMake((input.width + wd - 1) / wd, (input.height + ht - 1) / ht, input.arrayLength);
[encoder setComputePipelineState:grid_sample_Pipeline];
[encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
[encoder endEncoding];
}
else
return NO;
*error = nil;
return YES;
}
Post not yet marked as solved
When running the same code on my m1 Mac with tensorflow-metal vs in a google collab I see a problem with results.
The code: https://colab.research.google.com/drive/13GzSfToUvmmGHaROS-sGCu9mY1n_2FYf?usp=sharing
import tensorflow as tf
import numpy as np
import pandas as pd
# Setup model
input_shape = (10, 5)
model_tst = tf.keras.Sequential()
model_tst.add(tf.keras.Input(shape=input_shape))
model_tst.add(tf.keras.layers.LSTM(100, return_sequences=True))
model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid"))
model_tst.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
model_tst.compile(
loss=loss,
optimizer=optimizer,
# metrics=[tf.keras.metrics.BinaryCrossentropy()
metrics=["mse"
]
)
# Generate step data
random_input = np.ones((11, 10, 5))
random_input[:, 8:, :] = 99
# Predictions
random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2)
random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2)
# Compare results
diff2 = random_output3 - random_output2
pd.DataFrame(diff2).T
Output on Mac:
Output on google collab:
If I reduce the number of nodes in the LSTM I can get the problem to disappear:
import tensorflow as tf
import numpy as np
import pandas as pd
# Setup model
input_shape = (10, 5)
model_tst = tf.keras.Sequential()
model_tst.add(tf.keras.Input(shape=input_shape))
model_tst.add(tf.keras.layers.LSTM(2, return_sequences=True))
model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid"))
model_tst.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
model_tst.compile(
loss=loss,
optimizer=optimizer,
# metrics=[tf.keras.metrics.BinaryCrossentropy()
metrics=["mse"
]
)
# Generate step data
random_input = np.ones((11, 10, 5))
random_input[:, 8:, :] = 99
# Predictions
random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2)
random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2)
# Compare results
diff2 = random_output3 - random_output2
pd.DataFrame(diff2).T
-> outputs are the same in this case.
I guess this has to do with how calculations are getting passed to Apple silicon.
Any debugging steps I should try to result this problem?
Info:
I setup tensor flow using the following steps: https://developer.apple.com/metal/tensorflow-plugin/
When running I get this output showing that the GPU plugins are being used
MLCustomLayer implementation always dispatches to CPU instead of GPU
Background:
I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer member function being defined in the application's binding class for the custom layer.
I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++.
Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated!
Below, I provide a minimal example based off of the Swift example mentioned above.
Implementation
My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU.
PyTorch model to CoreML MLModel transformation
For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish.
To convert the scripted/traced PyTorch model (called torchscript_model in the listing below) to a CoreML MLModel, I use CoreMLTools (from Python) and then save the model as follows;
input_shapes = [[1,64,256,256]]
mlmodel = coremltools.converters.convert(
torchscript_model,
source='pytorch',
inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)],
add_custom_layers = True,
minimum_deployment_target = coremltools.target.iOS14,
compute_units = coremltools.ComputeUnit.CPU_AND_GPU,
)
mlmodel.save('toy_swish_model.mlmodel')
Metal shader
I use the same Metal shader function swish from Swish.metal here.
MLCustomLayer binding class for Swish MLModel layer
I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject and the MLCustomLayer protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel with a custom layer. This is defined as follows;
Class definition and resource setup;
#import <Foundation/Foundation.h>
#include <CoreML/CoreML.h>
#import <Metal/Metal.h>
@interface ToySwish : NSObject<MLCustomLayer>{}
@end
@implementation ToySwish{
id<MTLComputePipelineState> swishPipeline;
}
- (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{
NSError* errorPSO = nil;
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> defaultlibrary = [device newDefaultLibrary];
id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"];
swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO];
assert(errorPSO == nil);
return self;
}
- (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{
return YES;
}
- (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{
return inputShapes;
}
CPU compute method (this is only shown for completeness);
- (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
NSLog(@"Dispatching to CPU");
for(NSInteger i = 0; i < inputs.count; i++){
NSInteger num_elems = inputs[i].count;
float* input_ptr = (float *) inputs[i].dataPointer;
float* output_ptr = (float *) outputs[i].dataPointer;
for(int j = 0; j < num_elems; j++){
output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j]));
}
}
return YES;
}
Encode GPU commands to command buffer;
Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns.
- (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
NSLog(@"Dispatching to GPU");
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer
computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
assert(computeEncoder != nil);
for(int i = 0; i < inputs.count; i++){
[computeEncoder setComputePipelineState:swishPipeline];
[computeEncoder setTexture:inputs[i] atIndex:0];
[computeEncoder setTexture:outputs[i] atIndex:1];
NSInteger w = swishPipeline.threadExecutionWidth;
NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);
NSInteger groupWidth = (inputs[0].width + threadGroupSize.width - 1) / threadGroupSize.width;
NSInteger groupHeight = (inputs[0].height + threadGroupSize.height - 1) / threadGroupSize.height;
NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth;
MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth);
[computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize];
[computeEncoder endEncoding];
}
return YES;
}
Run inference for a given input
The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits are set to MLComputeUnitsAll as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers.
I define a MLDictionaryFeatureProvider object called feature_provider from a NSMutableDictionary of input features (input tensors in this case), and then pass this to the predictionFromFeatures method of my loaded model model as follows;
@autoreleasepool {
[model predictionFromFeatures:feature_provider error:error];
}
This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs method and thus on the CPU, instead of the GPU as expected.
I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :)
Environment
XCode 13.1
iPhone 13
iOS 15.1.1
iOS deployment target 15.0
SYSTEM:
MacBook Pro 14 (M1 Apple Silicon)
MacOS 12.0.1
DONE:
https://developer.apple.com/metal/tensorflow-plugin/
I have followed this for ARM M1 apple silicon for my 14"
(I had anaconda installed before, that may causing the error but I DO NOT want to delete my anaconda all together)
CODE
import tensorflow as tf
ERROR
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
~/miniforge3/lib/python3.9/site-packages/numpy/core/__init__.py in <module>
21 try:
---> 22 from . import multiarray
23 except ImportError as exc:
~/miniforge3/lib/python3.9/site-packages/numpy/core/multiarray.py in <module>
11
---> 12 from . import overrides
13 from . import _multiarray_umath
~/miniforge3/lib/python3.9/site-packages/numpy/core/overrides.py in <module>
6
----> 7 from numpy.core._multiarray_umath import (
8 add_docstring, implement_array_function, _get_implementing_args)
ImportError: dlopen(/Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so, 0x0002): Library not loaded: @rpath/libcblas.3.dylib
Referenced from: /Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so
Reason: tried: '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/bin/../lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/bin/../lib/libcblas.3.dylib' (no such file), '/usr/local/lib/libcblas.3.dylib' (no such file), '/usr/lib/libcblas.3.dylib' (no such file)
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
/var/folders/yp/mq9ddgh54gjg2rp7mw_t015c0000gn/T/ipykernel_95218/3793406994.py in <module>
----> 1 import tensorflow as tf
~/miniforge3/lib/python3.9/site-packages/tensorflow/__init__.py in <module>
39 import sys as _sys
40
---> 41 from tensorflow.python.tools import module_util as _module_util
42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
43
~/miniforge3/lib/python3.9/site-packages/tensorflow/python/__init__.py in <module>
39
40 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
---> 41 from tensorflow.python.eager import context
42
43 # pylint: enable=wildcard-import
~/miniforge3/lib/python3.9/site-packages/tensorflow/python/eager/context.py in <module>
28
29 from absl import logging
---> 30 import numpy as np
31 import six
32
~/miniforge3/lib/python3.9/site-packages/numpy/__init__.py in <module>
138 from . import _distributor_init
139
--> 140 from . import core
141 from .core import *
142 from . import compat
~/miniforge3/lib/python3.9/site-packages/numpy/core/__init__.py in <module>
46 """ % (sys.version_info[0], sys.version_info[1], sys.executable,
47 __version__, exc)
---> 48 raise ImportError(msg)
49 finally:
50 for envkey in env_added:
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.9 from "/Users/ps/miniforge3/bin/python"
* The NumPy version is: "1.19.5"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: dlopen(/Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so, 0x0002): Library not loaded: @rpath/libcblas.3.dylib
Referenced from: /Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-darwin.so
Reason: tried: '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/python3.9/site-packages/numpy/core/../../../../libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/bin/../lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/lib/libcblas.3.dylib' (no such file), '/Users/ps/miniforge3/bin/../lib/libcblas.3.dylib' (no such file), '/usr/local/lib/libcblas.3.dylib' (no such file), '/usr/lib/libcblas.3.dylib' (no such file)
Post not yet marked as solved
ML Compute APIs - https://developer.apple.com/documentation/mlcompute are in Swift. Are there C APIs for ML Compute?