ML Compute

RSS for tag

Accelerate training and validation of neural networks using the CPU and GPUs.

ML Compute Documentation

Posts under ML Compute tag

38 Posts
Sort by:
Post not yet marked as solved
1 Replies
312 Views
So I've read the documentation, downloaded the Accelerate source, and created a simple example. I'm attempting to solve a system of two equations, 90x+85y=400, and y-x=0. The result should be just greater than 2.25 for both x and y. What I get is [x,y]=[2.2857144, 205.7143]. I'm new to this, so I'm sure I've misread the docs, but I can't see where. Here is the code I modified to do my experiment. do{     let aValues: [Float] = [85, 90,                         1,-1]     /// The _b_ in _Ax = b_.     let bValues: [Float] = [400,0]     /// Call `nonsymmetric_general` to compute the _x_ in _Ax = b_.     let x = nonsymmetric_general(a: aValues,                                  dimension: 2,                                  b: bValues,                                  rightHandSideCount: 1)     /// Calculate _b_ using the computed _x_.     if let x = x {         let b = matrixVectorMultiply(matrix: aValues,                                      dimension: (m: 2, n: 2),                                      vector: x)         /// Prints _b_ in _Ax = b_ using the computed _x_: `~[70, 160, 250]`.         print("\nx = ",x)         print("\nb =", b)     } } What did I misunderstand? Thanks
Posted
by
Post not yet marked as solved
0 Replies
458 Views
Project is based on python3.8 and 3.9, containing some C and C++ source How can I do parallel computing on CPU and GPU of M1max In deed, I buy Mac m1max for the strong GPU to do quantitative finance, for which the speed is extremely important. Unfortunately, cuda is not compatible with Mac. Show me how to do it, thx. Are Accelerate(for CPU) and Metal(for GPU) can speed up any source by building like this: Step 1: download source from github Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply) 2、how is the compatibility of such method? I need speed up numpy, pandas and even a open souce project, such as https://github.com/microsoft/qlib 3、just show me the code 4、when compiling C++, C source, a lot of errors were reported, which gcc and g++ to choose? the default gcc installed by brew is 4.2.1, which cannot work. and I even tried to download gcc from the offical website of ARM, still cannot work. give me a hint. thx so much urgent
Posted
by
Post not yet marked as solved
4 Replies
998 Views
I am using a CoreML model from https://github.com/PeterL1n/RobustVideoMatting. I have an M1Macbook13 16G and an M1Max Macbook 16 64G. When "computeUnits" using .all or default, M1Max 16 is much slower than M1 13, finish one prediction time is 0.202 and 0.155. Using .cpuOnly, M1Max 16 is fast a little, time is 0.129 and 0.146. Using .cpuAndGPU, M1Max 16 is much fast than M1 13, time is 0.057 and 0.086. And when I use .all or default, M1Max will appear error messages like this: H11ANEDevice::H11ANEDeviceOpen IOServiceOpen failed result= 0xe00002e2 H11ANEDevice::H11ANEDeviceOpen kH11ANEUserClientCommand_DeviceOpen call failed result=0xe00002bc Error opening LB - status=0xe00002bc.. Skipping LB and retrying But M1 13 doesn't have any errors. So I want to know is this a bug of CoreML or M1Max? My Codes is like this: let config = MLModelConfiguration() config.computeUnits = .all let model = try rvm_mobilenetv3_1920x1080_s0_25_int8_ANE(configuration: config) let image1 = NSImage(named: "test1")?.cgImage(forProposedRect: nil, context: nil, hints: nil) let input = try? rvm_mobilenetv3_1920x1080_s0_25_int8_ANEInput(srcWith:image1!, r1i: MLMultiArray(), r2i: MLMultiArray(), r3i: MLMultiArray(), r4i: MLMultiArray())  _ = try? model.prediction(input: input!)
Posted
by
Post not yet marked as solved
0 Replies
328 Views
Hello! I’m having an issue with retrieving the trained weights from MLCLSTMLayer in ML Compute when training on a GPU. I maintain references to the input-weights, hidden-weights, and biases tensors and use the following code to extract the data post-training: extension MLCTensor { func dataArray<Scalar>(as _: Scalar.Type) throws -> [Scalar] where Scalar: Numeric { let count = self.descriptor.shape.reduce(into: 1) { (result, value) in result *= value } var array = [Scalar](repeating: 0, count: count) self.synchronizeData() // This *should* copy the latest data from the GPU to memory that’s accessible by the CPU _ = try array.withUnsafeMutableBytes { (pointer) in guard let data = self.data else { throw DataError.uninitialized // A custom error that I declare elsewhere } data.copyBytes(to: pointer) } return array } } The issue is that when I call dataArray(as:) on a weights or biases tensor for an LSTM layer that has been trained on a GPU, the values that it retrieves are the same as they were before training began. For instance, if I initialize the biases all to 0 and then train the LSTM layer on a GPU, the biases values seemingly remain 0 post-training, even though the reported loss values decrease as you would expect. This issue does not occur when training an LSTM layer on a CPU, and it also does not occur when training a fully-connected layer on a GPU. Since both types of layers work properly on a CPU but only MLCFullyConnectedLayer works properly on a GPU, it seems that the issue is a bug in ML Compute’s GPU implementation of MLCLSTMLayer specifically. For reference, I’m testing my code on M1 Max. Am I doing something wrong, or is this an actual bug that I should report in Feedback Assistant?
Posted
by
Post not yet marked as solved
2 Replies
667 Views
When running the same code on my m1 Mac with tensorflow-metal vs in a google collab I see a problem with results. The code: https://colab.research.google.com/drive/13GzSfToUvmmGHaROS-sGCu9mY1n_2FYf?usp=sharing import tensorflow as tf import numpy as np import pandas as pd # Setup model input_shape = (10, 5) model_tst = tf.keras.Sequential() model_tst.add(tf.keras.Input(shape=input_shape)) model_tst.add(tf.keras.layers.LSTM(100, return_sequences=True)) model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid")) model_tst.summary() optimizer = tf.keras.optimizers.Adam(learning_rate=0.01) loss = tf.keras.losses.BinaryCrossentropy(from_logits=False) model_tst.compile( loss=loss, optimizer=optimizer, # metrics=[tf.keras.metrics.BinaryCrossentropy() metrics=["mse" ] ) # Generate step data random_input = np.ones((11, 10, 5)) random_input[:, 8:, :] = 99 # Predictions random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2) random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2) # Compare results diff2 = random_output3 - random_output2 pd.DataFrame(diff2).T Output on Mac: Output on google collab: If I reduce the number of nodes in the LSTM I can get the problem to disappear: import tensorflow as tf import numpy as np import pandas as pd # Setup model input_shape = (10, 5) model_tst = tf.keras.Sequential() model_tst.add(tf.keras.Input(shape=input_shape)) model_tst.add(tf.keras.layers.LSTM(2, return_sequences=True)) model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid")) model_tst.summary() optimizer = tf.keras.optimizers.Adam(learning_rate=0.01) loss = tf.keras.losses.BinaryCrossentropy(from_logits=False) model_tst.compile( loss=loss, optimizer=optimizer, # metrics=[tf.keras.metrics.BinaryCrossentropy() metrics=["mse" ] ) # Generate step data random_input = np.ones((11, 10, 5)) random_input[:, 8:, :] = 99 # Predictions random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2) random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2) # Compare results diff2 = random_output3 - random_output2 pd.DataFrame(diff2).T -> outputs are the same in this case. I guess this has to do with how calculations are getting passed to Apple silicon. Any debugging steps I should try to result this problem? Info: I setup tensor flow using the following steps: https://developer.apple.com/metal/tensorflow-plugin/ When running I get this output showing that the GPU plugins are being used
Posted
by
Post not yet marked as solved
1 Replies
495 Views
I'm trying to implement a pytorch custom layer [grid_sampler] (https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.grid_sample.html) on GPU. Both of its inputs, input and grid can be 5-D. My implementation of encodeToCommandBuffer, which is MLCustomLayer protocol's function, is shown below. According to my current attempts, both value of id<MTLTexture> input and id<MTLTexture> grid don't meet expectations. So i wonder can MTLTexture be used to store 5-D input tensor as inputs of encodeToCommandBuffer? Or can anybody help to show me how to use MTLTexture correctly here? Thanks a lot! - (BOOL)encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer             inputs:(NSArray<id<MTLTexture>> *)inputs            outputs:(NSArray<id<MTLTexture>> *)outputs             error:(NSError * _Nullable *)error {   NSLog(@"Dispatching to GPU");   NSLog(@"inputs count %lu", (unsigned long)inputs.count);   NSLog(@"outputs count %lu", (unsigned long)outputs.count);   id<MTLComputeCommandEncoder> encoder = [commandBuffer       computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];     assert(encoder != nil);       id<MTLTexture> input = inputs[0];   id<MTLTexture> grid = inputs[1];   id<MTLTexture> output = outputs[0];   NSLog(@"inputs shape %lu, %lu, %lu, %lu", (unsigned long)input.width, (unsigned long)input.height, (unsigned long)input.depth, (unsigned long)input.arrayLength);   NSLog(@"grid shape %lu, %lu, %lu, %lu", (unsigned long)grid.width, (unsigned long)grid.height, (unsigned long)grid.depth, (unsigned long)grid.arrayLength);   if (encoder)   {     [encoder setTexture:input atIndex:0];     [encoder setTexture:grid atIndex:1];     [encoder setTexture:output atIndex:2];           NSUInteger wd = grid_sample_Pipeline.threadExecutionWidth;     NSUInteger ht = grid_sample_Pipeline.maxTotalThreadsPerThreadgroup / wd;     MTLSize threadsPerThreadgroup = MTLSizeMake(wd, ht, 1);     MTLSize threadgroupsPerGrid = MTLSizeMake((input.width + wd - 1) / wd, (input.height + ht - 1) / ht, input.arrayLength);     [encoder setComputePipelineState:grid_sample_Pipeline];     [encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];     [encoder endEncoding];         }   else     return NO;   *error = nil;   return YES; }
Posted
by
Post not yet marked as solved
0 Replies
384 Views
I am using the default HelloPhotogrammetry app you guys made: https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/ My system originally did not fit the specs because of a GPU issue to run this command line. To solve this issue I bought the Apple supported eGPU Black Magic to allow the graphics issue to function. Here is the error when I run it despite the eGPU: apply_selection_policy_once: prefer use of removable GPUs (via (null):GPUSelectionPolicy->preferRemovable) I have deduced that there needs to be this with the application running it: https://developer.apple.com/documentation/bundleresources/information_property_list/gpuselectionpolicy I tried modifying the Terminal.plist to the updated value - but there was no luck with it. I believe the CL within Xcode needs to have the updated value -- I need help on that aspect to be able to allow the system to use the eGPU. I did create a PropertyList within the MacOS app and added GPUSelectionPolicy with preferRemovable, and I am still having issues with the same above error. Please advice. Also -- to note, I did try to temporary turn off the Prefer External GPU within Terminal -- and it was doing the processing of the Photogrammetry but it was taking awhile to process (>30 mins plus.) I ended up killing that task. I did have a look at Activity Monitor and I did see that my internal GPU was being used, not my eGPU which is what I am trying to use. Previously -- when I did not have the eGPU plugged in - I would be getting an error saying that my specs did not meet criteria, so it was interesting to see that it assumed my Mac had criteria (which it technically did) it just did processing on the less powerful GPU.
Posted
by
Post not yet marked as solved
2 Replies
1.7k Views
Im using my 2020 Mac mini with M1 chip and this is the first time try to use it on convolutional neural network training. So the problem is I install the python(ver 3.8.12) using miniforge3 and Tensorflow following this instruction. But still facing the GPU problem when training a 3D Unet. Here's part of my code and hoping to receive some suggestion to fix this. import tensorflow as tf from tensorflow import keras import json import numpy as np import pandas as pd import nibabel as nib import matplotlib.pyplot as plt from tensorflow.keras import backend as K #check available devices def get_available_devices(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos] print(get_available_devices()) Metal device set to: Apple M1 ['/device:CPU:0', '/device:GPU:0'] 2022-02-09 11:52:55.468198: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-02-09 11:52:55.468885: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) X_norm_with_batch_dimension = np.expand_dims(X_norm, axis=0) #tf.device('/device:GPU:0') #Have tried this line doesn't work #tf.debugging.set_log_device_placement(True) #Have tried this line doesn't work patch_pred = model.predict(X_norm_with_batch_dimension) InvalidArgumentError: 2 root error(s) found. (0) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format. [[node model/conv3d/Conv3D (defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231) ]] [[model/conv3d/Conv3D/_4]] (1) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format. [[node model/conv3d/Conv3D (defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231) ]] 0 successful operations. 0 derived errors ignored. The code is executable on Google Colab but can't run on Mac mini locally with Jupyter notebook. The NHWC tensor format problem might indicate that Im using my CPU to execute the code instead of GPU. Is there anyway to optimise GPU to train the network in Tensorflow?
Posted
by
Post not yet marked as solved
1 Replies
383 Views
I need to build a model to add to my app and tried following the Apple docs here. No luck because I get an error that is discussed on this thread on the forum. I'm still not clear on why the error is occurring and can't resolve it. I wonder if CreateML inside Playgrounds is still supported at all? I tried using the CreateML app that you can access through developer tools but it just crashes my Mac (2017 MBP - is it just too much of a brick to use for ML at this point? I should think not because I've recently built and trained relatively simple models using Tensorflow. + Python on this machine, and the classifier I'm trying to make now is really simple and doesn't have a huge dataset).
Posted
by
Post not yet marked as solved
3 Replies
513 Views
I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine. I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably. It looks like the problem might be in some Metal Performance Shaders init code. The state of everything (backtraces, etc.) when the program freezes is attached. Backtraces
Posted
by
Post not yet marked as solved
0 Replies
212 Views
We are developing a simple GAN an when training the solution, the behavior of the convergence of the discriminator is different if we use GPU than using only CPU or even executing in Collab. We've read a lot, but this is the only one post that seems to talk about similar behavior. Unfortunately, after updating to 0.4 version problem persists. My Hardware/Software: MacBook Pro. model: MacBookPro18,2. Chip: Apple M1 Max. Cores: 10 (8 de rendimiento y 2 de eficiencia). Memory: 64 GB. firmware: 7459.101.3. OS: Monterey 12.3.1. OS Version: 7459.101.3. Python version 3.8 and libraries (the most related) using !pip freeze keras==2.8.0 Keras-Preprocessing==1.1.2 .... tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-datasets==4.5.2 tensorflow-docs @ git+https://github.com/tensorflow/docs@7d5ea2e986a4eae7573be3face00b3cccd4b8b8b tensorflow-macos==2.8.0 tensorflow-metadata==1.7.0 tensorflow-metal==0.4.0 #####. CODE TO REPRODUCE. ####### Code does not fit in the max space in this message... I've shared a Google Collab Notebook at: https://colab.research.google.com/drive/1oDS8EV0eP6kToUYJuxHf5WCZlRL0Ypgn?usp=sharing You can easily see that loss goes to 0 after 1 or 2 epochs when GPU is enabled, buy if GPU is disabled everything is OK
Posted
by
Post not yet marked as solved
0 Replies
205 Views
Trying to run python file in vs code but getting error mentioned in the title , I am basically trying to train deep learning model by importing libraries like tensorflow , numpy , pandas , matplotlib etc , I am only getting error Illegal instruction: 4 nothing else and one more thing same code is working fine in windows , plz help
Posted
by
Post marked as solved
2 Replies
233 Views
Hello everyone I found some problem in tf built-in function (tf.signal.stft) when I type the code below, it will cause problem. Device is MacBookPro with M1 Pro chip in jupyterlab However, the problem won't cause on linux with CUDA. Does anyone know how to fix the problem ? Thanks. code: import numpy as np import tensorflow as tf random_waveform = np.random.normal(size=(16000)) tf_waveform = tf.constant(random_waveform) tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128) error message: InvalidArgumentError Traceback (most recent call last) Input In [1], in <cell line: 6>() 4 random_waveform = np.random.normal(size=(16000)) 5 tf_waveform = tf.constant(random_waveform) ----> 6 tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128) File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs) 151 except Exception as e: 152 filtered_tb = _process_traceback_frames(e.__traceback__) --> 153 raise e.with_traceback(filtered_tb) from None 154 finally: 155 del filtered_tb File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name) 7162 def raise_from_not_ok_status(e, name): 7163 e.message += (" name: " + name if name is not None else "") -> 7164 raise core._status_to_exception(e) from None InvalidArgumentError: Multiple Default OpKernel registrations match NodeDef '{{node ZerosLike}}': 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' and 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' [Op:ZerosLike] 1
Posted
by
Post not yet marked as solved
0 Replies
123 Views
We use several CoreML models on our swift application. Memory footprint of these coreML models varies in a range from 15 kB to 3.5 MB according to the XCode coreML utility tool. We observe a huge difference of loading time in function of the type of the compute units selected to run the model. Here is a small sample code used to load the model: let configuration = MLModelConfiguration() //Here I use the the .all compute units mode: configuration.computeUnits = .all let myModel = try! myCoremlModel(configuration: configuration).model Here are the profiling results of this sample code for different models sizes in function of the targeted compute units: Model-3.5-MB : computeUnits is .cpuAndGPU: 188 ms ⇒ 18 MB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 4000 ms ⇒ 875 kB/s Model-2.6-MB: computeUnits is .cpuAndGPU: 144 ms ⇒ 18 MB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 1300 ms ⇒ 2 MB/s Model-15-kB: computeUnits is .cpuAndGPU: 18 ms ⇒ 833 kB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 700 ms ⇒ 22 kB/s What explained the difference of loading time in function en the computeUnits mode ? Is there a way to reduce the loading time of the models when using the .all or .cpuAndNeuralEngine computeUnits mode ?
Posted
by
Post not yet marked as solved
0 Replies
102 Views
We use dynamic input size for some uses cases. When compute unit mode is .all there is strong difference in the execution time if the dynamic input shape doesn’t fit with the optimal shape. If we set the model optimal input shape as 896x896 but run it with an input shape of 1024x768 the execution time is almost twice as slower compared to an input size of 896x896. For example a model set with 896x896 preferred input shape can achieve inference at 66 ms when input shape is 896x896. However this model only achieve inference at 117 ms when input shape is 1024x768. In that case if we want to achieve best performances at inference time we need to switch from a model to another in function of the input shape which is not dynamic at all and memory greedy. There is a way to reduce the execution time when shape out of the preferred shape range?
Posted
by
Post not yet marked as solved
0 Replies
72 Views
I'm trying to run sample code for MPS graph, which I got here: https://developer.apple.com/documentation/metalperformanceshadersgraph/adding_custom_functions_to_a_shader_graph And it's not working. Builds successfully, but after you press train (play button), program fails right after first training iteration with errors like these: -[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit' -[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit' It is failing on commandBuffer.commit() in runTrainingIterationBatch() method. Its like something already committed operation (I've checked and yeah, command buffer is already commited). But why such thing in EXAMPLE CODE? I've tried to wrap commit operation with command buffer status check and it is helping to not fail, but program works wrong overall and not calculating loss well. Everything is getting worse because documentation for MPS Graph is empty! It's contains only class and method names without any description D; My env: Xcode 13.4.1 (13F100) macOS 12.4 MacBook Pro (m1 pro) 14' 2021 16gb Tried to build on iPhone 12 Pro Max / iOS 15.5 and to Mac catalyst application. Got same error everywhere
Posted
by
Post not yet marked as solved
0 Replies
74 Views
Documentation for MPS graph has no information about class and methods functionality. Its only enumerates everything it's got without any explanation what it is and how it works. Why so? https://developer.apple.com/documentation/metalperformanceshadersgraph Though, In MPSGraph header files there is some commentaries, so it's seems like a bug
Posted
by