Post not yet marked as solved
Hello! I’m having an issue with retrieving the trained weights from MLCLSTMLayer in ML Compute when training on a GPU. I maintain references to the input-weights, hidden-weights, and biases tensors and use the following code to extract the data post-training:
extension MLCTensor {
func dataArray<Scalar>(as _: Scalar.Type) throws -> [Scalar] where Scalar: Numeric {
let count = self.descriptor.shape.reduce(into: 1) { (result, value) in
result *= value
}
var array = [Scalar](repeating: 0, count: count)
self.synchronizeData() // This *should* copy the latest data from the GPU to memory that’s accessible by the CPU
_ = try array.withUnsafeMutableBytes { (pointer) in
guard let data = self.data else {
throw DataError.uninitialized // A custom error that I declare elsewhere
}
data.copyBytes(to: pointer)
}
return array
}
}
The issue is that when I call dataArray(as:) on a weights or biases tensor for an LSTM layer that has been trained on a GPU, the values that it retrieves are the same as they were before training began. For instance, if I initialize the biases all to 0 and then train the LSTM layer on a GPU, the biases values seemingly remain 0 post-training, even though the reported loss values decrease as you would expect.
This issue does not occur when training an LSTM layer on a CPU, and it also does not occur when training a fully-connected layer on a GPU. Since both types of layers work properly on a CPU but only MLCFullyConnectedLayer works properly on a GPU, it seems that the issue is a bug in ML Compute’s GPU implementation of MLCLSTMLayer specifically.
For reference, I’m testing my code on M1 Max.
Am I doing something wrong, or is this an actual bug that I should report in Feedback Assistant?
Post not yet marked as solved
Project is based on python3.8 and 3.9, containing some C and C++ source
How can I do parallel computing on CPU and GPU of M1max
In deed, I buy Mac m1max for the strong GPU to do quantitative finance, for which the speed is extremely important. Unfortunately, cuda is not compatible with Mac.
Show me how to do it, thx.
Are Accelerate(for CPU) and Metal(for GPU) can speed up any source by building like this:
Step 1: download source from github
Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib
Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build
Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply)
2、how is the compatibility of such method? I need speed up numpy, pandas and even a open souce project, such as https://github.com/microsoft/qlib
3、just show me the code
4、when compiling C++, C source, a lot of errors were reported, which gcc and g++ to choose? the default gcc installed by brew is 4.2.1, which cannot work. and I even tried to download gcc from the offical website of ARM, still cannot work. give me a hint.
thx so much
urgent
Post not yet marked as solved
So I've read the documentation, downloaded the Accelerate source, and created a simple example.
I'm attempting to solve a system of two equations,
90x+85y=400, and
y-x=0.
The result should be just greater than 2.25 for both x and y. What I get is [x,y]=[2.2857144, 205.7143].
I'm new to this, so I'm sure I've misread the docs, but I can't see where.
Here is the code I modified to do my experiment.
do{
let aValues: [Float] = [85, 90,
1,-1]
/// The _b_ in _Ax = b_.
let bValues: [Float] = [400,0]
/// Call `nonsymmetric_general` to compute the _x_ in _Ax = b_.
let x = nonsymmetric_general(a: aValues,
dimension: 2,
b: bValues,
rightHandSideCount: 1)
/// Calculate _b_ using the computed _x_.
if let x = x {
let b = matrixVectorMultiply(matrix: aValues,
dimension: (m: 2, n: 2),
vector: x)
/// Prints _b_ in _Ax = b_ using the computed _x_: `~[70, 160, 250]`.
print("\nx = ",x)
print("\nb =", b)
}
}
What did I misunderstand?
Thanks
Post not yet marked as solved
I am running a test model on my MBP M1 pro and the GPU clock speed never goes above ~450mhz (GPU cores are 100%). Using other apps that peg the GPU I can see the clock speed is about 1.3ghz.
Is this is an issue with tf-metal or am I doing something wrong?
FR
Post not yet marked as solved
Can I run inference on the new MacBook Pro with M1 Chips (Apple Silicon) using Keras Models (sometimes PyTorch). These would be computer vision models, some might have custom loss functions or metrics and would have been trained on lets say, Google Colab.
If I can perform inference, how do I do that?
Also, will the Neural Engines help while performing inference or will it boost training if I have to train on the Mac?
Post not yet marked as solved
My Macbook pro is heating with the issue like left side cores are overloaded and right side cores are utilised.
Post not yet marked as solved
Hi all, I've spent some time experimenting with the BNNS (Accelerate) LSTM-related APIs lately and despite a distinct lack of documentation (even though the headers have quite a few) a got most things to a point where I think I know what's going on and I get the expected results.
However, one thing I have not been able to do is to get this working if inputSize != hiddenSize.
I am currently only concerned with a simple unidirectional LSTM with a single layer but none of my permutations of gate "iw_desc" matrices with various 2D layouts and reordering input-size/hidden-size made any difference, ultimately BNNSDirectApplyLSTMBatchTrainingCaching always returns -1 as an indication of error.
Any help would be greatly appreciated.
PS: The bnns.h framework header file claims that "When a parameter is invalid or an internal error occurs, an error message will be logged. Some combinations of parameters may not be supported. In that case, an info message will be logged.", and yet, I've not been able to find any such messages logged to NSLog() or stderr or Console. Is there a magic environment variable that I need to set to get more verbose logging?
Post not yet marked as solved
HI,@apple dev, can you apple make sure of macos10.13 support coreml using gpu infering?
I used macos10.13.6 do a test ,I found my coreml model inferring only using CPU(BNNS), no gpu using ....
So can anyone could make sure of that? give me an answer...
Thanks
Post not yet marked as solved
I tried Create ML to train MNIST dataset which has very small images of 0-10 digits. It's the first time I use Create ML but its training speed is still too slow based on what I learnt, MNIST is a very small dataset.
I am using a MacBook Pro 2021, 16 inch, with M1 pro + 16GB ram + 1TB SSD.
I check the activity monitor and saw that CPU reaches 100%.
14/16 GB of Memory are used, 2GB for cache and 12.5GB of swap used. Memory used by the MLRecipeExecutionService process is 19.55GB. If I double click to see the details, the Virtual Memory Size is 410GB.
I ran sudo powermetrics and observe that GPU power is ~50-60mw, which means GPU is not used for training.
When I check Disk usage in Activity Monitor, I saw that process MLRecipeExecutionService contributed 1.1TB of Bytes Write. The entire MNIST dataset is only 17.5MB.
I don't understand why it's so slow, and so much resources were used. Based on what I've learnt about Machine Learning, this is irregular.
Post not yet marked as solved
I am trying to train the pretrained network via transfer learning in CreateML with aprox.4500 images with bounding boxes. The CreateML stopped after 3700 iterations, allocates all memory and doing nothing. I can pause it and CreateML unallocated RAM. After that I can continue runnig training. I have MacMini 64GB RAM, eGPU Radeon 580 8GB, i5 6-core, BigSur, CreateML 3.0 (78.5)
Post not yet marked as solved
Hello everyone,
We are GPU developers who utilize PTX / GCN / RDNA ISA to develop our software
Is there any reference available for asm-level Neural Engine and GPU, so we can write our custom device code and get it built and running?
Not sure if this is a normal request, I understand there are high-level libraries such as FFT / BLAS / Accelerate etc available, but we need to go down in order to implement our own technology features that solve some specific problems we need to resolve first before rolling out the product for the Mac OS X platform
Post not yet marked as solved
Not long ago I have updated to Xcode Version 13.0 (13A233), and I have noticed that when I do a clean or build of my project the processor goes to 100% in all my cores, I have a 2.3 GHz 8-Core Intel Core i9 processor:
Post not yet marked as solved
I have a model I developed in Tensorflow 2.3 and then converted it to an MLModel w/ the coreml tools. I also reduce to to fp16. The model works great on most iOS devices but on a few particularly ones like the iPhone 11 pro max A2218 - it gives an NaN error in from on the NeuralEngine- if the same model is run on CPU/GPU t - there is no issues. I also tried the fp32 version of the model and has the same results of NaN on NeuralEngine and works great w/ CPU/GPU. Thoughts suggestions?
Hi, I would love to code with the Neural Engine on my macbook pro M1 2020.
Is there any low-level API to create my very own work-loads?
I am working with audio and MIDI. As well sound synthesis and mixing.
Can I use the Neural Engine to offload the CPU? I am especially interested in parallelism using threads.
My programming lanuage of choice is ANSI C and Objective C.
Post not yet marked as solved
I would like to generate and run ML program inside an app.
I got familiar with the coremltools and MIL format, however I can't seem to find any resources on how to generate mlmodel/mlpackage files using Swift on the device.
Is there any Swift equivalent of coremltools? Or is there a way to translate MIL description of a ML program into instance of a MLModel? Or something similar.
Post not yet marked as solved
Using an LSTM model for finance predictions I found these benchmark results:
TF 2.7 GPU - 188 Seconds (tensorflow-metal 0.1.2)
TF 2.5 GPU - 149 Seconds (tensorflow-metal 0.1.2)
The slowness is expected due to a small batch size.
TF 2.5 CPU - 6.91 Seconds
TF 2.5 CPU - 4.66 Seconds (added disable_eager_execution())
TF 2.7 CPU - 4.47 Seconds
So TF 2.7 (master) is about 4% faster using CPU. The metal Plugin is way slower with TF 2.7, but at least it works to enable the GPU.
Apple should make the sources for tensorflow-metal available on git and ensure to update it regulary for each TF main releases like currently 2.6.
Post not yet marked as solved
Can you please publish the sources for tensorflow-metal on git ? Thanks.
Post not yet marked as solved
My m1 MBP is not using the full gpu power when running TensorFlow. Its taking about 6 seconds / epoch when for the same task other get 1 second per epoch.
when running the training the Mac doesn't get hot and the fans don't rev.
Any advice?
Thanks, Logan