Post not yet marked as solved
I just got my new MacBook Pro with M1 Max chip and am setting up Python. I've tried several combinational settings to test speed - now I'm quite confused. First put my questions here:
Why python run natively on M1 Max is greatly (~100%) slower than on my old MacBook Pro 2016 with Intel i5?
On M1 Max, why there isn't significant speed difference between native run (by miniforge) and run via Rosetta (by anaconda) - which is supposed to be slower ~20%?
On M1 Max and native run, why there isn't significant speed difference between conda installed Numpy and TensorFlow installed Numpy - which is supposed to be faster?
On M1 Max, why run in PyCharm IDE is constantly slower ~20% than run from terminal, which doesn't happen on my old Intel Mac.
Evidence supporting my questions is as follows:
Here are the settings I've tried:
1. Python installed by
Miniforge-arm64, so that python is natively run on M1 Max Chip. (Check from Activity Monitor, Kind of python process is Apple).
Anaconda.: Then python is run via Rosseta. (Check from Activity Monitor, Kind of python process is Intel).
2. Numpy installed by
conda install numpy: numpy from original conda-forge channel, or pre-installed with anaconda.
Apple-TensorFlow: with python installed by miniforge, I directly install tensorflow, and numpy will also be installed. It's said that, numpy installed in this way is optimized for Apple M1 and will be faster. Here is the installation commands:
conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal
3. Run from
Terminal.
PyCharm (Apple Silicon version).
Here is the test code:
import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10
timecosts = []
for _ in range(runtimes):
s_time = time.time()
for i in range(100):
a += 1
np.linalg.svd(a)
timecosts.append(time.time() - s_time)
print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')
and here are the results:
+-----------------------------------+-----------------------+--------------------+
| Python installed by (run on)→ | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → | Terminal | PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
| Apple Tensorflow | 4.19151 | 4.86248 | / | / |
+-----------------------------------+------------+----------+----------+---------+
| conda install numpy | 4.29386 | 4.98370 | 4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+
This is quite slow. For comparison,
run the same code on my old MacBook Pro 2016 with i5 chip - it costs 2.39917s.
another post reports that run with M1 chip (not Pro or Max), miniforge+conda_installed_numpy is 2.53214s, and miniforge+apple_tensorflow_numpy is 1.00613s.
you may also try on it your own.
Here is the CPU information details:
My old i5:
$ sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz
machdep.cpu.core_count: 2
My new M1 Max:
% sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Apple M1 Max
machdep.cpu.core_count: 10
I follow instructions strictly from tutorials - but why would all these happen? Is it because of my installation flaws, or because of M1 Max chip? Since my work relies heavily on local runs, local speed is very important to me. Any suggestions to possible solution, or any data points on your own device would be greatly appreciated :)
Post not yet marked as solved
My Macbook pro is heating with the issue like left side cores are overloaded and right side cores are utilised.
Post not yet marked as solved
I am running a test model on my MBP M1 pro and the GPU clock speed never goes above ~450mhz (GPU cores are 100%). Using other apps that peg the GPU I can see the clock speed is about 1.3ghz.
Is this is an issue with tf-metal or am I doing something wrong?
FR
Post not yet marked as solved
Hi all, I've spent some time experimenting with the BNNS (Accelerate) LSTM-related APIs lately and despite a distinct lack of documentation (even though the headers have quite a few) a got most things to a point where I think I know what's going on and I get the expected results.
However, one thing I have not been able to do is to get this working if inputSize != hiddenSize.
I am currently only concerned with a simple unidirectional LSTM with a single layer but none of my permutations of gate "iw_desc" matrices with various 2D layouts and reordering input-size/hidden-size made any difference, ultimately BNNSDirectApplyLSTMBatchTrainingCaching always returns -1 as an indication of error.
Any help would be greatly appreciated.
PS: The bnns.h framework header file claims that "When a parameter is invalid or an internal error occurs, an error message will be logged. Some combinations of parameters may not be supported. In that case, an info message will be logged.", and yet, I've not been able to find any such messages logged to NSLog() or stderr or Console. Is there a magic environment variable that I need to set to get more verbose logging?
MLCustomLayer implementation always dispatches to CPU instead of GPU
Background:
I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer member function being defined in the application's binding class for the custom layer.
I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++.
Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated!
Below, I provide a minimal example based off of the Swift example mentioned above.
Implementation
My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU.
PyTorch model to CoreML MLModel transformation
For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish.
To convert the scripted/traced PyTorch model (called torchscript_model in the listing below) to a CoreML MLModel, I use CoreMLTools (from Python) and then save the model as follows;
input_shapes = [[1,64,256,256]]
mlmodel = coremltools.converters.convert(
torchscript_model,
source='pytorch',
inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)],
add_custom_layers = True,
minimum_deployment_target = coremltools.target.iOS14,
compute_units = coremltools.ComputeUnit.CPU_AND_GPU,
)
mlmodel.save('toy_swish_model.mlmodel')
Metal shader
I use the same Metal shader function swish from Swish.metal here.
MLCustomLayer binding class for Swish MLModel layer
I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject and the MLCustomLayer protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel with a custom layer. This is defined as follows;
Class definition and resource setup;
#import <Foundation/Foundation.h>
#include <CoreML/CoreML.h>
#import <Metal/Metal.h>
@interface ToySwish : NSObject<MLCustomLayer>{}
@end
@implementation ToySwish{
id<MTLComputePipelineState> swishPipeline;
}
- (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{
NSError* errorPSO = nil;
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> defaultlibrary = [device newDefaultLibrary];
id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"];
swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO];
assert(errorPSO == nil);
return self;
}
- (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{
return YES;
}
- (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{
return inputShapes;
}
CPU compute method (this is only shown for completeness);
- (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
NSLog(@"Dispatching to CPU");
for(NSInteger i = 0; i < inputs.count; i++){
NSInteger num_elems = inputs[i].count;
float* input_ptr = (float *) inputs[i].dataPointer;
float* output_ptr = (float *) outputs[i].dataPointer;
for(int j = 0; j < num_elems; j++){
output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j]));
}
}
return YES;
}
Encode GPU commands to command buffer;
Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns.
- (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
NSLog(@"Dispatching to GPU");
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer
computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
assert(computeEncoder != nil);
for(int i = 0; i < inputs.count; i++){
[computeEncoder setComputePipelineState:swishPipeline];
[computeEncoder setTexture:inputs[i] atIndex:0];
[computeEncoder setTexture:outputs[i] atIndex:1];
NSInteger w = swishPipeline.threadExecutionWidth;
NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);
NSInteger groupWidth = (inputs[0].width + threadGroupSize.width - 1) / threadGroupSize.width;
NSInteger groupHeight = (inputs[0].height + threadGroupSize.height - 1) / threadGroupSize.height;
NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth;
MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth);
[computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize];
[computeEncoder endEncoding];
}
return YES;
}
Run inference for a given input
The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits are set to MLComputeUnitsAll as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers.
I define a MLDictionaryFeatureProvider object called feature_provider from a NSMutableDictionary of input features (input tensors in this case), and then pass this to the predictionFromFeatures method of my loaded model model as follows;
@autoreleasepool {
[model predictionFromFeatures:feature_provider error:error];
}
This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs method and thus on the CPU, instead of the GPU as expected.
I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :)
Environment
XCode 13.1
iPhone 13
iOS 15.1.1
iOS deployment target 15.0
Post not yet marked as solved
HI,@apple dev, can you apple make sure of macos10.13 support coreml using gpu infering?
I used macos10.13.6 do a test ,I found my coreml model inferring only using CPU(BNNS), no gpu using ....
So can anyone could make sure of that? give me an answer...
Thanks
Post not yet marked as solved
I tried Create ML to train MNIST dataset which has very small images of 0-10 digits. It's the first time I use Create ML but its training speed is still too slow based on what I learnt, MNIST is a very small dataset.
I am using a MacBook Pro 2021, 16 inch, with M1 pro + 16GB ram + 1TB SSD.
I check the activity monitor and saw that CPU reaches 100%.
14/16 GB of Memory are used, 2GB for cache and 12.5GB of swap used. Memory used by the MLRecipeExecutionService process is 19.55GB. If I double click to see the details, the Virtual Memory Size is 410GB.
I ran sudo powermetrics and observe that GPU power is ~50-60mw, which means GPU is not used for training.
When I check Disk usage in Activity Monitor, I saw that process MLRecipeExecutionService contributed 1.1TB of Bytes Write. The entire MNIST dataset is only 17.5MB.
I don't understand why it's so slow, and so much resources were used. Based on what I've learnt about Machine Learning, this is irregular.
Post not yet marked as solved
I am trying to train the pretrained network via transfer learning in CreateML with aprox.4500 images with bounding boxes. The CreateML stopped after 3700 iterations, allocates all memory and doing nothing. I can pause it and CreateML unallocated RAM. After that I can continue runnig training. I have MacMini 64GB RAM, eGPU Radeon 580 8GB, i5 6-core, BigSur, CreateML 3.0 (78.5)
Post not yet marked as solved
Hello everyone,
We are GPU developers who utilize PTX / GCN / RDNA ISA to develop our software
Is there any reference available for asm-level Neural Engine and GPU, so we can write our custom device code and get it built and running?
Not sure if this is a normal request, I understand there are high-level libraries such as FFT / BLAS / Accelerate etc available, but we need to go down in order to implement our own technology features that solve some specific problems we need to resolve first before rolling out the product for the Mac OS X platform
Post not yet marked as solved
Not long ago I have updated to Xcode Version 13.0 (13A233), and I have noticed that when I do a clean or build of my project the processor goes to 100% in all my cores, I have a 2.3 GHz 8-Core Intel Core i9 processor:
Post not yet marked as solved
I have a model I developed in Tensorflow 2.3 and then converted it to an MLModel w/ the coreml tools. I also reduce to to fp16. The model works great on most iOS devices but on a few particularly ones like the iPhone 11 pro max A2218 - it gives an NaN error in from on the NeuralEngine- if the same model is run on CPU/GPU t - there is no issues. I also tried the fp32 version of the model and has the same results of NaN on NeuralEngine and works great w/ CPU/GPU. Thoughts suggestions?
Post not yet marked as solved
I would like to generate and run ML program inside an app.
I got familiar with the coremltools and MIL format, however I can't seem to find any resources on how to generate mlmodel/mlpackage files using Swift on the device.
Is there any Swift equivalent of coremltools? Or is there a way to translate MIL description of a ML program into instance of a MLModel? Or something similar.
Post not yet marked as solved
Using an LSTM model for finance predictions I found these benchmark results:
TF 2.7 GPU - 188 Seconds (tensorflow-metal 0.1.2)
TF 2.5 GPU - 149 Seconds (tensorflow-metal 0.1.2)
The slowness is expected due to a small batch size.
TF 2.5 CPU - 6.91 Seconds
TF 2.5 CPU - 4.66 Seconds (added disable_eager_execution())
TF 2.7 CPU - 4.47 Seconds
So TF 2.7 (master) is about 4% faster using CPU. The metal Plugin is way slower with TF 2.7, but at least it works to enable the GPU.
Apple should make the sources for tensorflow-metal available on git and ensure to update it regulary for each TF main releases like currently 2.6.
Post not yet marked as solved
Can you please publish the sources for tensorflow-metal on git ? Thanks.
Post not yet marked as solved
My m1 MBP is not using the full gpu power when running TensorFlow. Its taking about 6 seconds / epoch when for the same task other get 1 second per epoch.
when running the training the Mac doesn't get hot and the fans don't rev.
Any advice?
Thanks, Logan
Hi, I would love to code with the Neural Engine on my macbook pro M1 2020.
Is there any low-level API to create my very own work-loads?
I am working with audio and MIDI. As well sound synthesis and mixing.
Can I use the Neural Engine to offload the CPU? I am especially interested in parallelism using threads.
My programming lanuage of choice is ANSI C and Objective C.
Post not yet marked as solved
ML Compute APIs - https://developer.apple.com/documentation/mlcompute are in Swift. Are there C APIs for ML Compute?
Post not yet marked as solved
Can I run inference on the new MacBook Pro with M1 Chips (Apple Silicon) using Keras Models (sometimes PyTorch). These would be computer vision models, some might have custom loss functions or metrics and would have been trained on lets say, Google Colab.
If I can perform inference, how do I do that?
Also, will the Neural Engines help while performing inference or will it boost training if I have to train on the Mac?