ML Compute

RSS for tag

Accelerate training and validation of neural networks using the CPU and GPUs.

ML Compute Documentation

Posts under ML Compute tag

38 Posts
Sort by:
Post not yet marked as solved
6 Replies
14k Views
I just got my new MacBook Pro with M1 Max chip and am setting up Python. I've tried several combinational settings to test speed - now I'm quite confused. First put my questions here: Why python run natively on M1 Max is greatly (~100%) slower than on my old MacBook Pro 2016 with Intel i5? On M1 Max, why there isn't significant speed difference between native run (by miniforge) and run via Rosetta (by anaconda) - which is supposed to be slower ~20%? On M1 Max and native run, why there isn't significant speed difference between conda installed Numpy and TensorFlow installed Numpy - which is supposed to be faster? On M1 Max, why run in PyCharm IDE is constantly slower ~20% than run from terminal, which doesn't happen on my old Intel Mac. Evidence supporting my questions is as follows: Here are the settings I've tried: 1. Python installed by Miniforge-arm64, so that python is natively run on M1 Max Chip. (Check from Activity Monitor, Kind of python process is Apple). Anaconda.: Then python is run via Rosseta. (Check from Activity Monitor, Kind of python process is Intel). 2. Numpy installed by conda install numpy: numpy from original conda-forge channel, or pre-installed with anaconda. Apple-TensorFlow: with python installed by miniforge, I directly install tensorflow, and numpy will also be installed. It's said that, numpy installed in this way is optimized for Apple M1 and will be faster. Here is the installation commands: conda install -c apple tensorflow-deps python -m pip install tensorflow-macos python -m pip install tensorflow-metal 3. Run from Terminal. PyCharm (Apple Silicon version). Here is the test code: import time import numpy as np np.random.seed(42) a = np.random.uniform(size=(300, 300)) runtimes = 10 timecosts = [] for _ in range(runtimes): s_time = time.time() for i in range(100): a += 1 np.linalg.svd(a) timecosts.append(time.time() - s_time) print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s') and here are the results: +-----------------------------------+-----------------------+--------------------+ | Python installed by (run on)→ | Miniforge (native M1) | Anaconda (Rosseta) | +----------------------+------------+------------+----------+----------+---------+ | Numpy installed by ↓ | Run from → | Terminal | PyCharm | Terminal | PyCharm | +----------------------+------------+------------+----------+----------+---------+ | Apple Tensorflow | 4.19151 | 4.86248 | / | / | +-----------------------------------+------------+----------+----------+---------+ | conda install numpy | 4.29386 | 4.98370 | 4.10029 | 4.99271 | +-----------------------------------+------------+----------+----------+---------+ This is quite slow. For comparison, run the same code on my old MacBook Pro 2016 with i5 chip - it costs 2.39917s. another post reports that run with M1 chip (not Pro or Max), miniforge+conda_installed_numpy is 2.53214s, and miniforge+apple_tensorflow_numpy is 1.00613s. you may also try on it your own. Here is the CPU information details: My old i5: $ sysctl -a | grep -e brand_string -e cpu.core_count machdep.cpu.brand_string: Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz machdep.cpu.core_count: 2 My new M1 Max: % sysctl -a | grep -e brand_string -e cpu.core_count machdep.cpu.brand_string: Apple M1 Max machdep.cpu.core_count: 10 I follow instructions strictly from tutorials - but why would all these happen? Is it because of my installation flaws, or because of M1 Max chip? Since my work relies heavily on local runs, local speed is very important to me. Any suggestions to possible solution, or any data points on your own device would be greatly appreciated :)
Posted
by
Post not yet marked as solved
2 Replies
545 Views
I am running a test model on my MBP M1 pro and the GPU clock speed never goes above ~450mhz (GPU cores are 100%). Using other apps that peg the GPU I can see the clock speed is about 1.3ghz. Is this is an issue with tf-metal or am I doing something wrong? FR
Posted
by
Post not yet marked as solved
0 Replies
311 Views
Hi all, I've spent some time experimenting with the BNNS (Accelerate) LSTM-related APIs lately and despite a distinct lack of documentation (even though the headers have quite a few) a got most things to a point where I think I know what's going on and I get the expected results. However, one thing I have not been able to do is to get this working if inputSize != hiddenSize. I am currently only concerned with a simple unidirectional LSTM with a single layer but none of my permutations of gate "iw_desc" matrices with various 2D layouts and reordering input-size/hidden-size made any difference, ultimately BNNSDirectApplyLSTMBatchTrainingCaching always returns -1 as an indication of error. Any help would be greatly appreciated. PS: The bnns.h framework header file claims that "When a parameter is invalid or an internal error occurs, an error message will be logged. Some combinations of parameters may not be supported. In that case, an info message will be logged.", and yet, I've not been able to find any such messages logged to NSLog() or stderr or Console. Is there a magic environment variable that I need to set to get more verbose logging?
Posted
by
Post marked as solved
3 Replies
760 Views
MLCustomLayer implementation always dispatches to CPU instead of GPU Background: I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer member function being defined in the application's binding class for the custom layer. I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++. Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated! Below, I provide a minimal example based off of the Swift example mentioned above. Implementation My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU. PyTorch model to CoreML MLModel transformation For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish. To convert the scripted/traced PyTorch model (called torchscript_model in the listing below) to a CoreML MLModel, I use CoreMLTools (from Python) and then save the model as follows; input_shapes = [[1,64,256,256]] mlmodel = coremltools.converters.convert( torchscript_model, source='pytorch', inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)], add_custom_layers = True, minimum_deployment_target = coremltools.target.iOS14, compute_units = coremltools.ComputeUnit.CPU_AND_GPU, ) mlmodel.save('toy_swish_model.mlmodel') Metal shader I use the same Metal shader function swish from Swish.metal here. MLCustomLayer binding class for Swish MLModel layer I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject and the MLCustomLayer protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel with a custom layer. This is defined as follows; Class definition and resource setup; #import <Foundation/Foundation.h> #include <CoreML/CoreML.h> #import <Metal/Metal.h> @interface ToySwish : NSObject<MLCustomLayer>{} @end @implementation ToySwish{ id<MTLComputePipelineState> swishPipeline; } - (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{    NSError* errorPSO = nil;   id<MTLDevice> device = MTLCreateSystemDefaultDevice();   id<MTLLibrary> defaultlibrary = [device newDefaultLibrary];   id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"];   swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO]; assert(errorPSO == nil);   return self; } - (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{   return YES; } - (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{   return inputShapes; } CPU compute method (this is only shown for completeness); - (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{   NSLog(@"Dispatching to CPU");   for(NSInteger i = 0; i < inputs.count; i++){    NSInteger num_elems = inputs[i].count;    float* input_ptr = (float *) inputs[i].dataPointer;    float* output_ptr = (float *) outputs[i].dataPointer;        for(int j = 0; j < num_elems; j++){     output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j]));    }   }   return YES; } Encode GPU commands to command buffer; Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns. - (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{      NSLog(@"Dispatching to GPU");      id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];   assert(computeEncoder != nil); for(int i = 0; i < inputs.count; i++){      [computeEncoder setComputePipelineState:swishPipeline];   [computeEncoder setTexture:inputs[i] atIndex:0];   [computeEncoder setTexture:outputs[i] atIndex:1];      NSInteger w = swishPipeline.threadExecutionWidth;   NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w;   MTLSize threadGroupSize = MTLSizeMake(w, h, 1);   NSInteger groupWidth = (inputs[0].width    + threadGroupSize.width - 1) / threadGroupSize.width;   NSInteger groupHeight = (inputs[0].height   + threadGroupSize.height - 1) / threadGroupSize.height;   NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth;   MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth);   [computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize];   [computeEncoder endEncoding];    }   return YES; } Run inference for a given input The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits are set to MLComputeUnitsAll as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers. I define a MLDictionaryFeatureProvider object called feature_provider from a NSMutableDictionary of input features (input tensors in this case), and then pass this to the predictionFromFeatures method of my loaded model model as follows; @autoreleasepool { [model predictionFromFeatures:feature_provider error:error]; } This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs method and thus on the CPU, instead of the GPU as expected. I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :) Environment XCode 13.1 iPhone 13 iOS 15.1.1 iOS deployment target 15.0
Posted
by
Post not yet marked as solved
0 Replies
233 Views
HI,@apple dev, can you apple make sure of macos10.13 support coreml using gpu infering? I used macos10.13.6 do a test ,I found my coreml model inferring only using CPU(BNNS), no gpu using .... So can anyone could make sure of that? give me an answer... Thanks
Posted
by
Post not yet marked as solved
0 Replies
428 Views
I tried Create ML to train MNIST dataset which has very small images of 0-10 digits. It's the first time I use Create ML but its training speed is still too slow based on what I learnt, MNIST is a very small dataset. I am using a MacBook Pro 2021, 16 inch, with M1 pro + 16GB ram + 1TB SSD. I check the activity monitor and saw that CPU reaches 100%. 14/16 GB of Memory are used, 2GB for cache and 12.5GB of swap used. Memory used by the MLRecipeExecutionService process is 19.55GB. If I double click to see the details, the Virtual Memory Size is 410GB. I ran sudo powermetrics and observe that GPU power is ~50-60mw, which means GPU is not used for training. When I check Disk usage in Activity Monitor, I saw that process MLRecipeExecutionService contributed 1.1TB of Bytes Write. The entire MNIST dataset is only 17.5MB. I don't understand why it's so slow, and so much resources were used. Based on what I've learnt about Machine Learning, this is irregular.
Posted
by
Post not yet marked as solved
0 Replies
244 Views
I am trying to train the pretrained network via transfer learning in CreateML with aprox.4500 images with bounding boxes. The CreateML stopped after 3700 iterations, allocates all memory and doing nothing. I can pause it and CreateML unallocated RAM. After that I can continue runnig training. I have MacMini 64GB RAM, eGPU Radeon 580 8GB, i5 6-core, BigSur, CreateML 3.0 (78.5)
Posted
by
Post not yet marked as solved
1 Replies
743 Views
Hello everyone, We are GPU developers who utilize PTX / GCN / RDNA ISA to develop our software Is there any reference available for asm-level Neural Engine and GPU, so we can write our custom device code and get it built and running? Not sure if this is a normal request, I understand there are high-level libraries such as FFT / BLAS / Accelerate etc available, but we need to go down in order to implement our own technology features that solve some specific problems we need to resolve first before rolling out the product for the Mac OS X platform
Posted
by
Post not yet marked as solved
0 Replies
238 Views
Not long ago I have updated to Xcode Version 13.0 (13A233), and I have noticed that when I do a clean or build of my project the processor goes to 100% in all my cores, I have a 2.3 GHz 8-Core Intel Core i9 processor:
Posted
by
Post not yet marked as solved
0 Replies
367 Views
I have a model I developed in Tensorflow 2.3 and then converted it to an MLModel w/ the coreml tools. I also reduce to to fp16. The model works great on most iOS devices but on a few particularly ones like the iPhone 11 pro max A2218 - it gives an NaN error in from on the NeuralEngine- if the same model is run on CPU/GPU t - there is no issues. I also tried the fp32 version of the model and has the same results of NaN on NeuralEngine and works great w/ CPU/GPU. Thoughts suggestions?
Posted
by
Post not yet marked as solved
2 Replies
647 Views
I would like to generate and run ML program inside an app. I got familiar with the coremltools and MIL format, however I can't seem to find any resources on how to generate mlmodel/mlpackage files using Swift on the device. Is there any Swift equivalent of coremltools? Or is there a way to translate MIL description of a ML program into instance of a MLModel? Or something similar.
Posted
by
Post not yet marked as solved
0 Replies
579 Views
Using an LSTM model for finance predictions I found these benchmark results: TF 2.7 GPU - 188 Seconds (tensorflow-metal 0.1.2) TF 2.5 GPU - 149 Seconds (tensorflow-metal 0.1.2) The slowness is expected due to a small batch size. TF 2.5 CPU - 6.91 Seconds TF 2.5 CPU - 4.66 Seconds (added disable_eager_execution()) TF 2.7 CPU - 4.47 Seconds So TF 2.7 (master) is about 4% faster using CPU. The metal Plugin is way slower with TF 2.7, but at least it works to enable the GPU. Apple should make the sources for tensorflow-metal available on git and ensure to update it regulary for each TF main releases like currently 2.6.
Posted
by
Post not yet marked as solved
1 Replies
653 Views
My m1 MBP is not using the full gpu power when running TensorFlow. Its taking about 6 seconds / epoch when for the same task other get 1 second per epoch. when running the training the Mac doesn't get hot and the fans don't rev. Any advice? Thanks, Logan
Posted
by
Post marked as solved
4 Replies
2.8k Views
Hi, I would love to code with the Neural Engine on my macbook pro M1 2020. Is there any low-level API to create my very own work-loads? I am working with audio and MIDI. As well sound synthesis and mixing. Can I use the Neural Engine to offload the CPU? I am especially interested in parallelism using threads. My programming lanuage of choice is ANSI C and Objective C.
Posted
by
Post not yet marked as solved
4 Replies
1k Views
ML Compute APIs - https://developer.apple.com/documentation/mlcompute are in Swift. Are there C APIs for ML Compute?
Posted
by
Post not yet marked as solved
4 Replies
11k Views
Can I run inference on the new MacBook Pro with M1 Chips (Apple Silicon) using Keras Models (sometimes PyTorch). These would be computer vision models, some might have custom loss functions or metrics and would have been trained on lets say, Google Colab. If I can perform inference, how do I do that? Also, will the Neural Engines help while performing inference or will it boost training if I have to train on the Mac?
Posted
by