Post not yet marked as solved
So I've read the documentation, downloaded the Accelerate source, and created a simple example.
I'm attempting to solve a system of two equations,
90x+85y=400, and
y-x=0.
The result should be just greater than 2.25 for both x and y. What I get is [x,y]=[2.2857144, 205.7143].
I'm new to this, so I'm sure I've misread the docs, but I can't see where.
Here is the code I modified to do my experiment.
do{
let aValues: [Float] = [85, 90,
1,-1]
/// The _b_ in _Ax = b_.
let bValues: [Float] = [400,0]
/// Call `nonsymmetric_general` to compute the _x_ in _Ax = b_.
let x = nonsymmetric_general(a: aValues,
dimension: 2,
b: bValues,
rightHandSideCount: 1)
/// Calculate _b_ using the computed _x_.
if let x = x {
let b = matrixVectorMultiply(matrix: aValues,
dimension: (m: 2, n: 2),
vector: x)
/// Prints _b_ in _Ax = b_ using the computed _x_: `~[70, 160, 250]`.
print("\nx = ",x)
print("\nb =", b)
}
}
What did I misunderstand?
Thanks
Post not yet marked as solved
Project is based on python3.8 and 3.9, containing some C and C++ source
How can I do parallel computing on CPU and GPU of M1max
In deed, I buy Mac m1max for the strong GPU to do quantitative finance, for which the speed is extremely important. Unfortunately, cuda is not compatible with Mac.
Show me how to do it, thx.
Are Accelerate(for CPU) and Metal(for GPU) can speed up any source by building like this:
Step 1: download source from github
Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib
Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build
Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply)
2、how is the compatibility of such method? I need speed up numpy, pandas and even a open souce project, such as https://github.com/microsoft/qlib
3、just show me the code
4、when compiling C++, C source, a lot of errors were reported, which gcc and g++ to choose? the default gcc installed by brew is 4.2.1, which cannot work. and I even tried to download gcc from the offical website of ARM, still cannot work. give me a hint.
thx so much
urgent
Post not yet marked as solved
I am using a CoreML model from https://github.com/PeterL1n/RobustVideoMatting.
I have an M1Macbook13 16G and an M1Max Macbook 16 64G.
When "computeUnits" using .all or default, M1Max 16 is much slower than M1 13, finish one prediction time is 0.202 and 0.155.
Using .cpuOnly, M1Max 16 is fast a little, time is 0.129 and 0.146.
Using .cpuAndGPU, M1Max 16 is much fast than M1 13, time is 0.057 and 0.086.
And when I use .all or default, M1Max will appear error messages like this:
H11ANEDevice::H11ANEDeviceOpen IOServiceOpen failed result= 0xe00002e2
H11ANEDevice::H11ANEDeviceOpen kH11ANEUserClientCommand_DeviceOpen call failed result=0xe00002bc
Error opening LB - status=0xe00002bc.. Skipping LB and retrying
But M1 13 doesn't have any errors.
So I want to know is this a bug of CoreML or M1Max?
My Codes is like this:
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try rvm_mobilenetv3_1920x1080_s0_25_int8_ANE(configuration: config)
let image1 = NSImage(named: "test1")?.cgImage(forProposedRect: nil, context: nil, hints: nil)
let input = try? rvm_mobilenetv3_1920x1080_s0_25_int8_ANEInput(srcWith:image1!, r1i: MLMultiArray(), r2i: MLMultiArray(), r3i: MLMultiArray(), r4i: MLMultiArray())
_ = try? model.prediction(input: input!)
Post not yet marked as solved
Hello! I’m having an issue with retrieving the trained weights from MLCLSTMLayer in ML Compute when training on a GPU. I maintain references to the input-weights, hidden-weights, and biases tensors and use the following code to extract the data post-training:
extension MLCTensor {
func dataArray<Scalar>(as _: Scalar.Type) throws -> [Scalar] where Scalar: Numeric {
let count = self.descriptor.shape.reduce(into: 1) { (result, value) in
result *= value
}
var array = [Scalar](repeating: 0, count: count)
self.synchronizeData() // This *should* copy the latest data from the GPU to memory that’s accessible by the CPU
_ = try array.withUnsafeMutableBytes { (pointer) in
guard let data = self.data else {
throw DataError.uninitialized // A custom error that I declare elsewhere
}
data.copyBytes(to: pointer)
}
return array
}
}
The issue is that when I call dataArray(as:) on a weights or biases tensor for an LSTM layer that has been trained on a GPU, the values that it retrieves are the same as they were before training began. For instance, if I initialize the biases all to 0 and then train the LSTM layer on a GPU, the biases values seemingly remain 0 post-training, even though the reported loss values decrease as you would expect.
This issue does not occur when training an LSTM layer on a CPU, and it also does not occur when training a fully-connected layer on a GPU. Since both types of layers work properly on a CPU but only MLCFullyConnectedLayer works properly on a GPU, it seems that the issue is a bug in ML Compute’s GPU implementation of MLCLSTMLayer specifically.
For reference, I’m testing my code on M1 Max.
Am I doing something wrong, or is this an actual bug that I should report in Feedback Assistant?
Post not yet marked as solved
When running the same code on my m1 Mac with tensorflow-metal vs in a google collab I see a problem with results.
The code: https://colab.research.google.com/drive/13GzSfToUvmmGHaROS-sGCu9mY1n_2FYf?usp=sharing
import tensorflow as tf
import numpy as np
import pandas as pd
# Setup model
input_shape = (10, 5)
model_tst = tf.keras.Sequential()
model_tst.add(tf.keras.Input(shape=input_shape))
model_tst.add(tf.keras.layers.LSTM(100, return_sequences=True))
model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid"))
model_tst.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
model_tst.compile(
loss=loss,
optimizer=optimizer,
# metrics=[tf.keras.metrics.BinaryCrossentropy()
metrics=["mse"
]
)
# Generate step data
random_input = np.ones((11, 10, 5))
random_input[:, 8:, :] = 99
# Predictions
random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2)
random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2)
# Compare results
diff2 = random_output3 - random_output2
pd.DataFrame(diff2).T
Output on Mac:
Output on google collab:
If I reduce the number of nodes in the LSTM I can get the problem to disappear:
import tensorflow as tf
import numpy as np
import pandas as pd
# Setup model
input_shape = (10, 5)
model_tst = tf.keras.Sequential()
model_tst.add(tf.keras.Input(shape=input_shape))
model_tst.add(tf.keras.layers.LSTM(2, return_sequences=True))
model_tst.add(tf.keras.layers.Dense(2, activation="sigmoid"))
model_tst.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
model_tst.compile(
loss=loss,
optimizer=optimizer,
# metrics=[tf.keras.metrics.BinaryCrossentropy()
metrics=["mse"
]
)
# Generate step data
random_input = np.ones((11, 10, 5))
random_input[:, 8:, :] = 99
# Predictions
random_output2 = model_tst.predict(random_input, batch_size=1)[0, :, :].reshape(10, 2)
random_output3 = model_tst.predict(random_input, batch_size=10)[0, :, :].reshape(10, 2)
# Compare results
diff2 = random_output3 - random_output2
pd.DataFrame(diff2).T
-> outputs are the same in this case.
I guess this has to do with how calculations are getting passed to Apple silicon.
Any debugging steps I should try to result this problem?
Info:
I setup tensor flow using the following steps: https://developer.apple.com/metal/tensorflow-plugin/
When running I get this output showing that the GPU plugins are being used
Post not yet marked as solved
I'm trying to implement a pytorch custom layer [grid_sampler] (https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.grid_sample.html) on GPU. Both of its inputs, input and grid can be 5-D. My implementation of encodeToCommandBuffer, which is MLCustomLayer protocol's function, is shown below. According to my current attempts, both value of id<MTLTexture> input and id<MTLTexture> grid don't meet expectations. So i wonder can MTLTexture be used to store 5-D input tensor as inputs of encodeToCommandBuffer? Or can anybody help to show me how to use MTLTexture correctly here? Thanks a lot!
- (BOOL)encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer
inputs:(NSArray<id<MTLTexture>> *)inputs
outputs:(NSArray<id<MTLTexture>> *)outputs
error:(NSError * _Nullable *)error {
NSLog(@"Dispatching to GPU");
NSLog(@"inputs count %lu", (unsigned long)inputs.count);
NSLog(@"outputs count %lu", (unsigned long)outputs.count);
id<MTLComputeCommandEncoder> encoder = [commandBuffer
computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
assert(encoder != nil);
id<MTLTexture> input = inputs[0];
id<MTLTexture> grid = inputs[1];
id<MTLTexture> output = outputs[0];
NSLog(@"inputs shape %lu, %lu, %lu, %lu", (unsigned long)input.width, (unsigned long)input.height, (unsigned long)input.depth, (unsigned long)input.arrayLength);
NSLog(@"grid shape %lu, %lu, %lu, %lu", (unsigned long)grid.width, (unsigned long)grid.height, (unsigned long)grid.depth, (unsigned long)grid.arrayLength);
if (encoder)
{
[encoder setTexture:input atIndex:0];
[encoder setTexture:grid atIndex:1];
[encoder setTexture:output atIndex:2];
NSUInteger wd = grid_sample_Pipeline.threadExecutionWidth;
NSUInteger ht = grid_sample_Pipeline.maxTotalThreadsPerThreadgroup / wd;
MTLSize threadsPerThreadgroup = MTLSizeMake(wd, ht, 1);
MTLSize threadgroupsPerGrid = MTLSizeMake((input.width + wd - 1) / wd, (input.height + ht - 1) / ht, input.arrayLength);
[encoder setComputePipelineState:grid_sample_Pipeline];
[encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
[encoder endEncoding];
}
else
return NO;
*error = nil;
return YES;
}
Post not yet marked as solved
I am using the default HelloPhotogrammetry app you guys made: https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/
My system originally did not fit the specs because of a GPU issue to run this command line. To solve this issue I bought the Apple supported eGPU Black Magic to allow the graphics issue to function. Here is the error when I run it despite the eGPU: apply_selection_policy_once: prefer use of removable GPUs (via (null):GPUSelectionPolicy->preferRemovable)
I have deduced that there needs to be this with the application running it: https://developer.apple.com/documentation/bundleresources/information_property_list/gpuselectionpolicy
I tried modifying the Terminal.plist to the updated value - but there was no luck with it. I believe the CL within Xcode needs to have the updated value -- I need help on that aspect to be able to allow the system to use the eGPU.
I did create a PropertyList within the MacOS app and added GPUSelectionPolicy with preferRemovable, and I am still having issues with the same above error. Please advice.
Also -- to note, I did try to temporary turn off the Prefer External GPU within Terminal -- and it was doing the processing of the Photogrammetry but it was taking awhile to process (>30 mins plus.) I ended up killing that task. I did have a look at Activity Monitor and I did see that my internal GPU was being used, not my eGPU which is what I am trying to use. Previously -- when I did not have the eGPU plugged in - I would be getting an error saying that my specs did not meet criteria, so it was interesting to see that it assumed my Mac had criteria (which it technically did) it just did processing on the less powerful GPU.
Post not yet marked as solved
Im using my 2020 Mac mini with M1 chip and this is the first time try to use it on convolutional neural network training.
So the problem is I install the python(ver 3.8.12) using miniforge3 and Tensorflow following this instruction. But still facing the GPU problem when training a 3D Unet.
Here's part of my code and hoping to receive some suggestion to fix this.
import tensorflow as tf
from tensorflow import keras
import json
import numpy as np
import pandas as pd
import nibabel as nib
import matplotlib.pyplot as plt
from tensorflow.keras import backend as K
#check available devices
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
print(get_available_devices())
Metal device set to: Apple M1
['/device:CPU:0', '/device:GPU:0']
2022-02-09 11:52:55.468198: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-09 11:52:55.468885: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
X_norm_with_batch_dimension = np.expand_dims(X_norm, axis=0)
#tf.device('/device:GPU:0') #Have tried this line doesn't work
#tf.debugging.set_log_device_placement(True) #Have tried this line doesn't work
patch_pred = model.predict(X_norm_with_batch_dimension)
InvalidArgumentError: 2 root error(s) found.
(0) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format.
[[node model/conv3d/Conv3D
(defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231)
]] [[model/conv3d/Conv3D/_4]]
(1) INVALID_ARGUMENT: CPU implementation of Conv3D currently only supports the NHWC tensor format.
[[node model/conv3d/Conv3D
(defined at /Users/mwshay/miniforge3/envs/tensor/lib/python3.8/site-packages/keras/layers/convolutional.py:231) ]]
0 successful operations.
0 derived errors ignored.
The code is executable on Google Colab but can't run on Mac mini locally with Jupyter notebook. The NHWC tensor format problem might indicate that Im using my CPU to execute the code instead of GPU.
Is there anyway to optimise GPU to train the network in Tensorflow?
Post not yet marked as solved
I need to build a model to add to my app and tried following the Apple docs here.
No luck because I get an error that is discussed on this thread on the forum. I'm still not clear on why the error is occurring and can't resolve it.
I wonder if CreateML inside Playgrounds is still supported at all? I tried using the CreateML app that you can access through developer tools but it just crashes my Mac (2017 MBP - is it just too much of a brick to use for ML at this point? I should think not because I've recently built and trained relatively simple models using Tensorflow. + Python on this machine, and the classifier I'm trying to make now is really simple and doesn't have a huge dataset).
Post not yet marked as solved
I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine.
I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably.
It looks like the problem might be in some Metal Performance Shaders init code.
The state of everything (backtraces, etc.) when the program freezes is attached.
Backtraces
Post not yet marked as solved
I tried to install dependencies with PDM for a project and found the inconvenience that TensorFlow does not detect the GPU of the M1 Pro. When I create a virtual environment with poetry I do not have the same problem.
Any hint on how to solve this inconvenience?
Post not yet marked as solved
We are developing a simple GAN an when training the solution, the behavior of the convergence of the discriminator is different if we use GPU than using only CPU or even executing in Collab.
We've read a lot, but this is the only one post that seems to talk about similar behavior.
Unfortunately, after updating to 0.4 version problem persists.
My Hardware/Software: MacBook Pro. model: MacBookPro18,2. Chip: Apple M1 Max. Cores: 10 (8 de rendimiento y 2 de eficiencia). Memory: 64 GB. firmware: 7459.101.3. OS: Monterey 12.3.1. OS Version: 7459.101.3.
Python version 3.8 and libraries (the most related) using !pip freeze
keras==2.8.0 Keras-Preprocessing==1.1.2 .... tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-datasets==4.5.2 tensorflow-docs @ git+https://github.com/tensorflow/docs@7d5ea2e986a4eae7573be3face00b3cccd4b8b8b tensorflow-macos==2.8.0 tensorflow-metadata==1.7.0 tensorflow-metal==0.4.0
#####. CODE TO REPRODUCE. ####### Code does not fit in the max space in this message... I've shared a Google Collab Notebook at:
https://colab.research.google.com/drive/1oDS8EV0eP6kToUYJuxHf5WCZlRL0Ypgn?usp=sharing
You can easily see that loss goes to 0 after 1 or 2 epochs when GPU is enabled, buy if GPU is disabled everything is OK
Post not yet marked as solved
Trying to run python file in vs code but getting error mentioned in the title , I am basically trying to train deep learning model by importing libraries like tensorflow , numpy , pandas , matplotlib etc , I am only getting error Illegal instruction: 4 nothing else and one more thing same code is working fine in windows , plz help
Hello everyone
I found some problem in tf built-in function (tf.signal.stft)
when I type the code below, it will cause problem.
Device is MacBookPro with M1 Pro chip in jupyterlab
However, the problem won't cause on linux with CUDA.
Does anyone know how to fix the problem ?
Thanks.
code:
import numpy as np
import tensorflow as tf
random_waveform = np.random.normal(size=(16000))
tf_waveform = tf.constant(random_waveform)
tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128)
error message:
InvalidArgumentError Traceback (most recent call last)
Input In [1], in <cell line: 6>()
4 random_waveform = np.random.normal(size=(16000))
5 tf_waveform = tf.constant(random_waveform)
----> 6 tf_stft_waveform = tf.signal.stft(tf_waveform, frame_length=255, frame_step=128)
File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb
File ~/miniconda3/envs/AI/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name)
7162 def raise_from_not_ok_status(e, name):
7163 e.message += (" name: " + name if name is not None else "")
-> 7164 raise core._status_to_exception(e) from None
InvalidArgumentError: Multiple Default OpKernel registrations match NodeDef '{{node ZerosLike}}': 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' and 'op: "ZerosLike" device_type: "DEFAULT" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "y"' [Op:ZerosLike]
1
Post not yet marked as solved
We use several CoreML models on our swift application. Memory footprint of these coreML models varies in a range from 15 kB to 3.5 MB according to the XCode coreML utility tool. We observe a huge difference of loading time in function of the type of the compute units selected to run the model.
Here is a small sample code used to load the model:
let configuration = MLModelConfiguration()
//Here I use the the .all compute units mode:
configuration.computeUnits = .all
let myModel = try! myCoremlModel(configuration: configuration).model
Here are the profiling results of this sample code for different models sizes in function of the targeted compute units:
Model-3.5-MB :
computeUnits is .cpuAndGPU: 188 ms ⇒ 18 MB/s
computeUnits is .all or .cpuAndNeuralEngine on iOS16: 4000 ms ⇒ 875 kB/s
Model-2.6-MB:
computeUnits is .cpuAndGPU: 144 ms ⇒ 18 MB/s
computeUnits is .all or .cpuAndNeuralEngine on iOS16: 1300 ms ⇒ 2 MB/s
Model-15-kB:
computeUnits is .cpuAndGPU: 18 ms ⇒ 833 kB/s
computeUnits is .all or .cpuAndNeuralEngine on iOS16: 700 ms ⇒ 22 kB/s
What explained the difference of loading time in function en the computeUnits mode ? Is there a way to reduce the loading time of the models when using the .all or .cpuAndNeuralEngine computeUnits mode ?
Post not yet marked as solved
We use dynamic input size for some uses cases. When compute unit mode is .all there is strong difference in the execution time if the dynamic input shape doesn’t fit with the optimal shape. If we set the model optimal input shape as 896x896 but run it with an input shape of 1024x768 the execution time is almost twice as slower compared to an input size of 896x896.
For example a model set with 896x896 preferred input shape can achieve inference at 66 ms when input shape is 896x896. However this model only achieve inference at 117 ms when input shape is 1024x768.
In that case if we want to achieve best performances at inference time we need to switch from a model to another in function of the input shape which is not dynamic at all and memory greedy. There is a way to reduce the execution time when shape out of the preferred shape range?
Post not yet marked as solved
I'm trying to run sample code for MPS graph, which I got here: https://developer.apple.com/documentation/metalperformanceshadersgraph/adding_custom_functions_to_a_shader_graph
And it's not working. Builds successfully, but after you press train (play button), program fails right after first training iteration with errors like these:
-[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit'
-[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600001693940 (label: (null)), referenced in cmd buffer 0x124015800 (label: (null)) is in volatile or empty purgeable state at commit'
It is failing on commandBuffer.commit() in runTrainingIterationBatch() method.
Its like something already committed operation (I've checked and yeah, command buffer is already commited). But why such thing in EXAMPLE CODE?
I've tried to wrap commit operation with command buffer status check and it is helping to not fail, but program works wrong overall and not calculating loss well.
Everything is getting worse because documentation for MPS Graph is empty! It's contains only class and method names without any description D;
My env:
Xcode 13.4.1 (13F100)
macOS 12.4
MacBook Pro (m1 pro) 14' 2021 16gb
Tried to build on iPhone 12 Pro Max / iOS 15.5 and to Mac catalyst application. Got same error everywhere
Post not yet marked as solved
Documentation for MPS graph has no information about class and methods functionality. Its only enumerates everything it's got without any explanation what it is and how it works. Why so?
https://developer.apple.com/documentation/metalperformanceshadersgraph
Though, In MPSGraph header files there is some commentaries, so it's seems like a bug