Problem
I am trying to use the jax.numpy.einsum function (https://jax.readthedocs.io/en/latest/_autosummary/jax.numpy.einsum.html). However, for some subscripts, this seems to fail.
Hardware
Apple M1 Max, 32GB RAM
Steps to Reproduce
follow installation steps from https://developer.apple.com/metal/jax/
conda create -n 'jax_metal_demo' python=3.11
conda activate jax_metal_demo
python -m pip install numpy wheel ml-dtypes==0.2.0
python -m pip install jax-metal
Save the following code in a file called minimal_example.py
import numpy as np
from jax import device_put
import jax.numpy as jnp
np.random.seed(0)
a = np.random.rand(11, 12, 13, 11, 12)
b = np.random.rand(11, 12, 13)
subscripts = 'ijklm,ijk->lmk'
# intended result
print(np.einsum(subscripts, a, b))
# will cause crash
a, b = device_put(a), device_put(b)
print(jnp.einsum(subscripts, a, b))
run the code
python minimal_example.py
Output
I waas expecting
Platform 'METAL' is experimental and not all JAX functionality may be correctly supported!
2024-02-12 16:45:34.684973: W pjrt_plugin/src/mps_client.cc:563] WARNING: JAX Apple GPU support is experimental and not all JAX functionality is correctly supported!
Metal device set to: Apple M1 Max
systemMemory: 32.00 GB
maxCacheSize: 10.67 GB
Traceback (most recent call last):
File "/Users/linus/workspace/minimal_example.py", line 15, in <module>
print(jnp.einsum(subscripts, a, b))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/linus/miniforge3/envs/jax_metal_demo/lib/python3.11/site-packages/jax/_src/numpy/lax_numpy.py", line 3369, in einsum
return _einsum_computation(operands, contractions, precision, # type: ignore[operator]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/linus/miniforge3/envs/jax_metal_demo/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: /Users/linus/workspace/minimal_example.py:15:6: error: failed to legalize operation 'mhlo.dot_general'
print(jnp.einsum(subscripts, a, b))
^
/Users/linus/workspace/minimal_example.py:15:6: note: see current operation: %0 = "mhlo.dot_general"(%arg1, %arg0) {dot_dimension_numbers = #mhlo.dot<lhs_batching_dimensions = [2], rhs_batching_dimensions = [2], lhs_contracting_dimensions = [0, 1], rhs_contracting_dimensions = [0, 1]>, precision_config = [#mhlo<precision DEFAULT>, #mhlo<precision DEFAULT>]} : (tensor<11x12x13xf32>, tensor<11x12x13x11x12xf32>) -> tensor<13x11x12xf32>
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
Conclusion
I would greatly appreciate any ideas for workarounds.
tensorflow-metal
RSS for tagTensorFlow accelerates machine learning model training with Metal on Mac GPUs.
Posts under tensorflow-metal tag
116 Posts
Sort by:
Post
Replies
Boosts
Views
Activity
macbook pro m2 max/ 64G / macos:13.2.1 (22D68)
import tensorflow as tf
def runMnist(device = '/device:CPU:0'):
with tf.device(device):
#tf.config.set_default_device(device)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
runMnist(device = '/device:CPU:0')
runMnist(device = '/device:GPU:0')
Hi, I have a an issue with jax.numpy.linalg.inv(a).
import jax.numpy.linalg as jnpl
B = jnp.identity(2)
jnpl.inv(B)
Throws the following error:
XlaRuntimeError: UNKNOWN: /var/folders/pw/wk5rfkjj6qggqp8r8zb2bw8w0000gn/T/ipykernel_34334/2572982404.py:9:0: error: failed to legalize operation 'mhlo.triangular_solve'
/var/folders/pw/wk5rfkjj6qggqp8r8zb2bw8w0000gn/T/ipykernel_34334/2572982404.py:9:0: note: called from
/var/folders/pw/wk5rfkjj6qggqp8r8zb2bw8w0000gn/T/ipykernel_34334/2572982404.py:9:0: note: see current operation: %120 = \"mhlo.triangular_solve\"(%42#4, %119) {left_side = true, lower = true, transpose_a = #mhlo<transpose NO_TRANSPOSE>, unit_diagonal = true} : (tensor<2x2xf32>, tensor<2x2xf32>) -> tensor<2x2xf32>
Any ideas what could be the issue or how to solve it?
I haven't used the GPU implementation for over a year now due to constant issues (I use tf.config.set_visible_devices([], 'GPU') to use CPU only.
I have also had a couple of issues with model convergence using GPU, however this issue seems more prominent, and possibly unrelated.
Here is an example of code that causes a memory leak using GPU (I cannot link the dataset, but it is called: Text classification documentation, by TANISHQ DUBLISH on Kaggle.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
df = pd.read_csv('df_file.csv')
df.head()
train_df = df.sample(frac=0.7, random_state=42)
val_df = df.drop(train_df.index).sample(frac=0.5, random_state=42)
test_df = df.drop(train_df.index).drop(val_df.index)
train_dataset = tf.data.Dataset.from_tensor_slices((train_df['Text'].values, train_df['Label'].values)).batch(32).prefetch(tf.data.AUTOTUNE)
val_dataset = tf.data.Dataset.from_tensor_slices((val_df['Text'].values, val_df['Label'].values)).batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_df['Text'].values, test_df['Label'].values)).batch(32).prefetch(tf.data.AUTOTUNE)
text_vectorizer = tf.keras.layers.TextVectorization(max_tokens=100_000, output_mode='int', output_sequence_length=1000, pad_to_max_tokens=True)
text_vectorizer.adapt(train_df['Text'].values)
embedding = tf.keras.layers.Embedding(input_dim=len(text_vectorizer.get_vocabulary()), output_dim=128, input_length=1000)
inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = tf.keras.layers.LSTM(64)(x)
outputs = tf.keras.layers.Dense(5, activation='softmax')(x)
model_2 = tf.keras.Model(inputs, outputs, name='model_2_lstm')
model_2.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), optimizer=tf.keras.optimizers.legacy.Adam(), metrics=['accuracy'])
model_2_history = model_2.fit(train_dataset, epochs=50, validation_data=val_dataset, callbacks=[
tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
tf.keras.callbacks.ModelCheckpoint(model_2.name, save_best_only=True),
tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=5, verbose=1)
])
On an Apple M1 with Ventura 13.6.
I followed the steps on the Get started with tensorflow-metal page here:
https://developer.apple.com/metal/tensorflow-plugin/
python3 -m venv ~/venv-metal
source ~/venv-metal/bin/activate
python -m pip install -U pip
python -m pip install tensorflow
python -m pip install tensorflow-metal
With a clean start I also tried a pinning
python -m pip install tensorflow==2.13.0
Where Successfully installed tensorflow-metal-1.0.0
The table here suggested this should work.
https://pypi.org/project/tensorflow-metal/
But I got the same error...
Running Python code without the tensorflow import was not a problem. I found forums with similar error on Mac 1 but none of the proposed solution worked.
Is there suggested steps to get the `get started tutorial working?
Kia ora,
Been having heaps of trouble recently trying to get TensorFlow working, it just suddenly stopped and the kernel would just crash every time I try to import tf.
I've tried just about everything eg. fresh install of python, reinstalling Xcode dev tools
Below is the relevant lines of pip freeze, using python 1.10.13 btw
tensorboard==2.15.1
tensorboard-data-server==0.7.2
tensorboard-plugin-wit==1.8.1
tensorflow==2.15.0
tensorflow-estimator==2.15.0
tensorflow-io-gcs-filesystem==0.34.0
tensorflow-macos==2.15.0
tensorflow-metal==0.5.0
Below is the cell in question that is killing the kernal
import tensorflow as tf import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dense, Flatten, InputLayer, BatchNormalization, Dropout
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers.legacy import Adam
I'll be around all day so if you have anything that can help, I'll be sure to give it a go as soon as you post it and get back to you!
Looking forward to your replies.
Nga mihi,
Kane
I did a clean install of Python (v. 3.10), then Tensorflow & Tensorflow-Metal following exactly the process stated in Apple's plugin support page. Now, every time I run ANY python code with Tensorflow it crashes in the model.fit instruction. It does not matter what I feed into it, even code that used to run perfectly on my previous MacBook (Intel)... I've researched ad-vomitum for answers but Apple washes it's hands stating that is Tensorflow and Tensorflow does the same. Fact is that exactly the same code runs flawlessly on my Windows NVIDIA PC setup.
I purchased the m3 laptop with the hope of having the possibility to train my neural networks "on the go"... now I lost $5,000 usd, I can't make it work, and is a total disaster.
I am extremely competent in Python development and have been developing neural networks for years. So if you are going to comment, please avoid suggestions like "check your Python version" etc. - This is DEFINITIVELY due to the m3 Mac. Exact same setup is working OK on an M1-Ultra Mac Studio. It is just not portable...
Does anyone have any specific advice on how to make a proper setup of Tensorflow for the Mac M3??
Running grouped convolutions on an M2 with the metal plugin I get an error. Example code:
Using TF2.11 and no metal plugin I get
import tensorflow as tf
tf.keras.layers.Conv1D(5,1,padding="same", kernel_initializer="ones", groups=5)(tf.ones((1,1,5)))
# displays
<tf.Tensor: shape=(1, 1, 5), dtype=float32, numpy=array([[[1., 1., 1., 1., 1.]]], dtype=float32)>
On TF2.14 with the plugin I received
import tensorflow as tf
tf.keras.layers.Conv1D(5,1,padding="same", kernel_initializer="ones", groups=5)(tf.ones((1,1,5)))
# displays
...
NotFoundError: Exception encountered when calling layer 'conv1d_3' (type Conv1D).
could not find registered platform with id: 0x104d8f6f0 [Op:__inference__jit_compiled_convolution_op_78]
Call arguments received by layer 'conv1d_3' (type Conv1D):
• inputs=tf.Tensor(shape=(1, 1, 5), dtype=float32)
could not find registered platform with id
Tensorflow-Metal training got an increasing loss in CNN.
But same codes run correctly after pip uninstall tensorflow-metal
Hi,
there seems to be a difference in behavior when running inference on a trained Keras model using the model __call__ method vs. using the predict or predict_on_batch methods. This only happens when using the GPU for inference and it seems that for certain sequence of operations and float types the 'relu' activation doesn't work as expected and seems to do nothing.
I can replicate the problem with the following code (it would only fail with 'relu' activation and tf.float16 and tf.float32 types, while it works fine with tf.float64).
import tensorflow as tf
import numpy as np
DATA_LENGTH = 16
DENSE_WIDTH = 16
BATCH_SIZE = 8
DTYPE = tf.float32
ACTIVATION = 'relu'
def TestModel():
inputs = tf.keras.Input(DATA_LENGTH, dtype=DTYPE)
u = tf.keras.layers.Dense(DENSE_WIDTH, activation=ACTIVATION, dtype=DTYPE)(inputs)
# u = tf.maximum(u, 0.0)
output = u*tf.constant(1.0, dtype=DTYPE)
model = tf.keras.Model(inputs, output, name="TestModel")
return model
model = TestModel()
model.compile()
x = np.random.uniform(size=(BATCH_SIZE, DATA_LENGTH)).astype(DTYPE.as_numpy_dtype)
with tf.device('/GPU:0'):
out_gpu_call = model(x, training=False)
out_gpu_predict = model.predict_on_batch(x)
with tf.device('/CPU:0'):
out_cpu_call = model(x, training=False)
out_cpu_predict= model.predict_on_batch(x)
print(f'\nDTYPE {DTYPE}, ACTIVATION: {ACTIVATION}')
print("\tMean Abs. Difference GPU (__call__ vs. predict):", np.mean(np.abs(out_gpu_call - out_gpu_predict)))
print("\tMean Abs. Difference CPU (__call__ vs. predict):", np.mean(np.abs(out_cpu_call - out_cpu_predict)))
print("\tMean Abs. Difference GPU-CPU __call__:", np.mean(np.abs(out_gpu_call - out_cpu_call)))
print("\tMean Abs. Difference GPU-CPU predict():", np.mean(np.abs(out_gpu_predict - out_cpu_predict)))
The code above produces for example the following output:
DTYPE <dtype: 'float32'>, ACTIVATION: relu
Mean Abs. Difference GPU (__call__ vs. predict): 0.1955472
Mean Abs. Difference CPU (__call__ vs. predict): 0.0
Mean Abs. Difference GPU-CPU __call__: 1.3573299e-08
Mean Abs. Difference GPU-CPU predict(): 0.1955472
And the results for the GPU are:
out_gpu_call
<tf.Tensor: shape=(8, 16), dtype=float32, numpy=
array([[0.1496982 , 0. , 0. , 0.73772687, 0.26131183,
0.27757105, 0. , 0. , 0. , 0. ,
0. , 0.4164225 , 1.0367445 , 0. , 0.5860609 ,
0. ], ...
out_gpu_predict
array([[ 1.49698198e-01, -3.48425686e-01, -2.44667321e-01,
7.37726867e-01, 2.61311829e-01, 2.77571052e-01,
-2.26729304e-01, -1.06500387e-01, -3.66294265e-01,
-2.93850392e-01, -4.51043218e-01, 4.16422486e-01,
1.03674448e+00, -1.39347658e-01, 5.86060882e-01,
-2.05334812e-01], ...
Upon inspection of the results it seems that the problem is that the 'relu' activation is not setting the values < 0 to 0 when calling predict_on_batch.
When uncommenting the # u = tf.maximum(u, 0.0) line after the Dense layer there is no difference between the two calls (as should be expected).
It also happens that removing the multiplication by a constant after the Dense layer, output = u*tf.constant(1.0, dtype=DTYPE) makes the problem dissappear (even when leaving the # u = tf.maximum(u, 0.0) line commented).
This is running with the following setup:
MacBook Pro, Apple M2 Max chip, macOS Sonoma 14.2
tf version 2.15.0
tensorflow-metal 1.1.0
Python 3.10.13
Hello
I use Mac Pro M2 16GB
This is my code. It is very basic code.
`model = Sequential()
model.add(LSTM(units=50, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, batch_size=16)
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
train_predict = scaler.inverse_transform(train_predict)
y_train = scaler.inverse_transform(y_train)
test_predict = scaler.inverse_transform(test_predict)
y_test = scaler.inverse_transform(y_test)
When I try to execute this code, anaconda gives the following error
I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
I can't find any solution, could you help me
Thank you
Hello,
I got a brand new MacBook M3 Pro and trying to configure Tensorflow w/ GPU support. I followed instructions provided at https://developer.apple.com/metal/tensorflow-plugin/ step by step. Unfortunately, even after creating/recreating/installing/uninstalling TensorFlow the problem is not getting resolved as Python crashes. I cannot get past that point to try Jupyter notebook. Here is the error ask the versions in "tf" environment. I already spent entire Saturday yesterday and so far no progress. Can someone tell me what is going on?
Python 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import tensorflow as tf
2024-01-07 11:44:04.893581: F tensorflow/c/experimental/stream_executor/stream_executor.cc:743] Non-OK-status: stream_executor::MultiPlatformManager::RegisterPlatform( std::move(cplatform)) status: INTERNAL: platform is already registered with name: "METAL"
[1] 1797 abort /opt/homebrew/bin/python3
❯ python -m pip list | grep tensorflow
tensorflow 2.15.0
tensorflow-estimator 2.15.0
tensorflow-io-gcs-filesystem 0.34.0
tensorflow-macos 2.15.0
tensorflow-metal 1.1.0
❯ python --version
Python 3.11.7
OS is Sonoma 14.2.1
Thanks
Sohail
Here is my environment :
python==3.9.0
tensorflow==2.9.0
os==Sonoma 14.2 (23C64)
Error :
Translated Report (Full Report Below)
Process: Python [10330]
Path: /Library/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/Python
Identifier: org.python.python
Version: 3.9.0 (3.9.0)
Code Type: X86-64 (Translated)
Parent Process: Python [8039]
Responsible: Terminal [779]
User ID: 501
Date/Time: 2023-12-30 22:31:38.4916 +0530
OS Version: macOS 14.2 (23C64)
Report Version: 12
Anonymous UUID: F7E462E7-6380-C3DA-E2EC-5CF01A61D195
Sleep/Wake UUID: 50F32A2D-8CFA-4117-8048-D9CF76E24F26
Time Awake Since Boot: 29000 seconds
Time Since Wake: 2193 seconds
System Integrity Protection: enabled
Notes:
PC register does not match crashing frame (0x0 vs 0x10CEDE6D9)
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Termination Reason: Namespace SIGNAL, Code 4 Illegal instruction: 4
Terminating Process: exc handler [10330]
Error Formulating Crash Report:
PC register does not match crashing frame (0x0 vs 0x10CEDE6D9)
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 _cpu_feature_guard.so 0x10cede6d9 _GLOBAL__sub_I_cpu_feature_guard.cc + 9
1 dyld 0x2026a3fca invocation function for block in dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const::$_0::operator()() const + 182
2 dyld 0x2026e5584 invocation function for block in dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void (unsigned int) block_pointer, void const*) const + 133
3 dyld 0x2026d9913 invocation function for block in dyld3::MachOFile::forEachSection(void (dyld3::MachOFile::SectionInfo const&, bool, bool&) block_pointer) const + 543
4 dyld 0x20268707f dyld3::MachOFile::forEachLoadCommand(Diagnostics&, void (load_command const*, bool&) block_pointer) const + 249
5 dyld 0x2026d8adc dyld3::MachOFile::forEachSection(void (dyld3::MachOFile::SectionInfo const&, bool, bool&) block_pointer) const + 176
6 dyld 0x2026db104 dyld3::MachOFile::forEachInitializerPointerSection(Diagnostics&, void (unsigned int, unsigned int, bool&) block_pointer) const + 116
7 dyld 0x2026e52ba dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void (unsigned int) block_pointer, void const*) const + 390
8 dyld 0x2026a0cfc dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const + 222
9 dyld 0x2026a65cb dyld4::JustInTimeLoader::runInitializers(dyld4::RuntimeState&) const + 21
10 dyld 0x2026a0ef1 dyld4::Loader::runInitializersBottomUp(dyld4::RuntimeState&, dyld3::Array<dyld4::Loader const*>&) const + 181
11 dyld 0x2026a4040 dyld4::Loader::runInitializersBottomUpPlusUpwardLinks(dyld4::RuntimeState&) const::$_1::operator()() const + 98
12 dyld 0x2026a0f87 dyld4::Loader::runInitializersBottomUpPlusUpwardLinks(dyld4::RuntimeState&) const + 93
13 dyld 0x2026bdc65 dyld4::APIs::dlopen_from(char const*, int, void*) + 935
14 _ctypes.cpython-39-darwin.so 0x10ac20962 py_dl_open + 162
15 Python 0x10b444f2d cfunction_call + 125
16 Python 0x10b40625d _PyObject_MakeTpCall + 365
17 Python 0x10b4dc8fc call_function + 876
18 Python 0x10b4d9e2b _PyEval_EvalFrameDefault + 25371
19 Python 0x10b4dd563 _PyEval_EvalCode + 2611
20 Python 0x10b4069b1 _PyFunction_Vectorcall + 289
21 Python 0x10b4060b5 _PyObject_FastCallDictTstate + 293
22 Python 0x10b406c98 _PyObject_Call_Prepend + 152
23 Python 0x10b4601e5 slot_tp_init + 165
24 Python 0x10b45b699 type_call + 345
...
Thread 1:: com.apple.rosetta.exceptionserver
0 runtime 0x7ff7fffaf294 0x7ff7fffab000 + 17044
Thread 2:: /Reaper
0 ??? 0x7ff8aa35ea78 ???
1 libsystem_kernel.dylib 0x7ff819da46fa kevent + 10
2 libzmq.5.dylib 0x10bf038f6 zmq::kqueue_t::loop() + 278
3 libzmq.5.dylib 0x10bf31a59 zmq::worker_poller_base_t::worker_routine(void) + 25
4 libzmq.5.dylib 0x10bf7854c thread_routine(void*) + 300
5 libsystem_pthread.dylib 0x7ff819ddf202 _pthread_start + 99
6 libsystem_pthread.dylib 0x7ff819ddabab thread_start + 15
Thread 3:: /0
0 ??? 0x7ff8aa35ea78 ???
1 libsystem_kernel.dylib 0x7ff819da46fa kevent + 10
2 libzmq.5.dylib 0x10bf038f6 zmq::kqueue_t::loop() + 278
3 libzmq.5.dylib 0x10bf31a59 zmq::worker_poller_base_t::worker_routine(void) + 25
4 libzmq.5.dylib 0x10bf7854c thread_routine(void*) + 300
5 libsystem_pthread.dylib 0x7ff819ddf202 _pthread_start + 99
6 libsystem_pthread.dylib 0x7ff819ddabab thread_start + 15
Thread 4:
0 ??? 0x7ff8aa35ea78 ???
1 libsystem_kernel.dylib 0x7ff819da46fa kevent + 10
2 select.cpython-39-darwin.so 0x10ab95dc3 select_kqueue_control + 915
3 Python 0x10b40f11f method_vectorcall_FASTCALL + 335
4 Python 0x10b4dc86c call_function + 732
5 Python 0x10b4d9d72 _PyEval_EvalFrameDefault + 25186
6 Python 0x10b4dd563 _PyEval_EvalCode + 2611
7 Python 0x10b4069b1 _PyFunction_Vectorcall + 289
...
I'm using an anaconda environment
Tensorflow-macos 2.15
Keras 2.15
Python 3.11.5
macOS m2 14.1
I guess problem with Pycharm, because cod is working and error is: Cannot find reference 'keras' in 'imported module tensorflow | init.py'.
Previously I built a model on a simple MNIST and it's working but have same problem.
I have tried different references and versions of python. I've changed environments at least 3 times and it doesn't work.
Hello,
I followed the instructions provided here: https://developer.apple.com/metal/tensorflow-plugin/ and while trying to run the example I am getting following error:
otFoundError: dlopen(/Users/nedimhadzic/venv-metal/lib/python3.11/site-packages/tensorflow-plugins/libmetal_plugin.dylib, 0x0006): Symbol not found: __ZN10tensorflow16TensorShapeProtoC1ERKS0_
Referenced from: <C62E0AB4-567E-3E14-8F96-9F07A746C4DC> /Users/nedimhadzic/venv-metal/lib/python3.11/site-packages/tensorflow-plugins/libmetal_plugin.dylib
Expected in: <FFF31651-3926-3E79-A442-143B7156FB13> /Users/nedimhadzic/venv-metal/lib/python3.11/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
tensorflow: 2.15.0
tensorlow-metal: 1.0.0
macos: 14.2.1
Intel CPU and AMD Radeon Pro 5500M
Any idea?
Regards,
Nedim
Hi,
I think many of us would love to be able to use our GPUs for Jax on the new Apple Silicon devices, but currently, the Jax-metal plugin is, for all effects and purposes, broken. Is it still under active development? Is there a planned release for a new version?
thanks!
Running the sample Python keras-ocr example on M3 Max returns incorrect results if tensorflow-metal is installed.
Code Example: https://keras-ocr.readthedocs.io/en/latest/examples/using_pretrained_models.html
Note: https://upload.wikimedia.org/wikipedia/commons/e/e8/FseeG2QeLXo.jpg not found. Line commented out.
Without tensorflow-metal (Correct results):
['toodstande', 's', 'somme', 'srny', 'squadron', 'ds', 'quentn', 'snhnen', 'bnpnone', 'sasne', 'taing', 'yeoms', 'sry', 'the', 'royal', 'wessex', 'yeomanry', 'regiment', 'yeomanry', 'wests', 'south', 'the', 'now', 'recruiting', 'arm', 'blon', 'wxybsqipsacomodn', 'email', '438300', '01722']
['banana', 'union', 'no', 'no', 'software', 'patents']
With tensorflow-metal (Incorrect results):
['sddoooo', '', 'eamnooss', 'xynrr', 'daanues', 'idd', 'innee', 'iiiinus', 'tnounppanab', 'inla', 'ppnt', 'mmnooexyy', 'yyr', 'ehhtt', 'laayvyoorr', 'xeseww', 'rinamoevy', 'tnemiger', 'yrnamoey', 'sstseww', 'htuwlos', 'fefeahit', 'wwoniia', 'turceedrr', 'ymmrira', 'atate', 'prasbyxwr', 'liamme', '00338803144', '22277100']
['annnaab', 'noolinnu', 'oon', 'oon', 'wttffoos', 'sttneettaap']
Logs: With tensorflow-metal (Incorrect results)
(.venv) <REDACTED> % pip3 install -U tensorflow-metal
Collecting tensorflow-metal
Using cached tensorflow_metal-1.1.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (1.2 kB)
Requirement already satisfied: wheel~=0.35 in ./.venv/lib/python3.11/site-packages (from tensorflow-metal) (0.42.0)
Requirement already satisfied: six>=1.15.0 in ./.venv/lib/python3.11/site-packages (from tensorflow-metal) (1.16.0)
Using cached tensorflow_metal-1.1.0-cp311-cp311-macosx_12_0_arm64.whl (1.4 MB)
Installing collected packages: tensorflow-metal
Successfully installed tensorflow-metal-1.1.0
(.venv) <REDACTED> % python3 keras-ocr-bug.py
Looking for <REDACTED>/.keras-ocr/craft_mlt_25k.h5
2023-12-16 22:05:05.452493: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Max
2023-12-16 22:05:05.452532: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 64.00 GB
2023-12-16 22:05:05.452545: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 24.00 GB
2023-12-16 22:05:05.452591: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-12-16 22:05:05.452609: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:tensorflow:From <REDACTED>/.venv/lib/python3.11/site-packages/tensorflow/python/util/dispatch.py:1260: resize_bilinear (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.image.resize(...method=ResizeMethod.BILINEAR...)` instead.
Looking for <REDACTED>/.keras-ocr/crnn_kurapan.h5
2023-12-16 22:05:07.526354: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
1/1 [==============================] - 1s 855ms/step
2/2 [==============================] - 1s 140ms/step
['sddoooo', '', 'eamnooss', 'xynrr', 'daanues', 'idd', 'innee', 'iiiinus', 'tnounppanab', 'inla', 'ppnt', 'mmnooexyy', 'yyr', 'ehhtt', 'laayvyoorr', 'xeseww', 'rinamoevy', 'tnemiger', 'yrnamoey', 'sstseww', 'htuwlos', 'fefeahit', 'wwoniia', 'turceedrr', 'ymmrira', 'atate', 'prasbyxwr', 'liamme', '00338803144', '22277100']
['annnaab', 'noolinnu', 'oon', 'oon', 'wttffoos', 'sttneettaap']
Logs: Valid results, without tensorflow-metal
(.venv) <REDACTED> % pip3 uninstall tensorflow-metal
Found existing installation: tensorflow-metal 1.1.0
Uninstalling tensorflow-metal-1.1.0:
Would remove:
<REDACTED>/.venv/lib/python3.11/site-packages/tensorflow-plugins/*
<REDACTED>/.venv/lib/python3.11/site-packages/tensorflow_metal-1.1.0.dist-info/*
Proceed (Y/n)? Y
Successfully uninstalled tensorflow-metal-1.1.0
(.venv) <REDACTED> % python3 keras-ocr-bug.py
Looking for <REDACTED>/.keras-ocr/craft_mlt_25k.h5
WARNING:tensorflow:From <REDACTED>/.venv/lib/python3.11/site-packages/tensorflow/python/util/dispatch.py:1260: resize_bilinear (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.image.resize(...method=ResizeMethod.BILINEAR...)` instead.
Looking for <REDACTED>/.keras-ocr/crnn_kurapan.h5
1/1 [==============================] - 7s 7s/step
2/2 [==============================] - 1s 71ms/step
['toodstande', 's', 'somme', 'srny', 'squadron', 'ds', 'quentn', 'snhnen', 'bnpnone', 'sasne', 'taing', 'yeoms', 'sry', 'the', 'royal', 'wessex', 'yeomanry', 'regiment', 'yeomanry', 'wests', 'south', 'the', 'now', 'recruiting', 'arm', 'blon', 'wxybsqipsacomodn', 'email', '438300', '01722']
['banana', 'union', 'no', 'no', 'software', 'patents']
My environment: Tensorflow: 2.14, tf-metal: 1.1, M3 Max
I am working on an GAN full of residual sum and concatenation. It is trained correctly if using CPU only. However, if I enable GPU, it would cause:
oc("mps_slice_1"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/d615290d-668b-11ee-9734-0697ca55970a/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":359:0)): error: 'mps.slice' op failed: length value 32 does not fit within the dimension size (33) with start value (32)
/AppleInternal/Library/BuildRoots/d615290d-668b-11ee-9734-0697ca55970a/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:2133: failed assertion `Error: MLIR pass manager failed'
Some customization I guess might be related to the error:
tf.bitwise.bitwise_xor, tf.concat, tf.pad in custom layers
numpy.random in train steps.
Another debug hint I found is that the "32" is the number of channel of my models' conv layer, and change as I change the number of channel.
Is there anyone know what is wrong? Thank you so much
x = tf.Variable(tf.ones(3))
x[1].assign(5)
Above code results in:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation ResourceStridedSliceAssign: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ResourceStridedSliceAssign: CPU
_Arg: GPU CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
ref (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
ResourceStridedSliceAssign (ResourceStridedSliceAssign) /job:localhost/replica:0/task:0/device:GPU:0
Op: ResourceStridedSliceAssign
Node attrs: ellipsis_mask=0, Index=DT_INT32, T=DT_FLOAT, shrink_axis_mask=1, end_mask=0, begin_mask=0, new_axis_mask=0
Registered kernels:
device='XLA_CPU_JIT'; Index in [DT_INT32, DT_INT64]; T in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, DT_INT8, DT_COMPLEX64, DT_INT64, DT_BOOL, DT_QINT8, DT_QUINT8, DT_QINT32, DT_BFLOAT16, DT_UINT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64, DT_FLOAT8_E5M2, DT_FLOAT8_E4M3FN, DT_INT4, DT_UINT4]
device='DEFAULT'; T in [DT_INT32]
device='CPU'; T in [DT_UINT64]
device='CPU'; T in [DT_INT64]
device='CPU'; T in [DT_UINT32]
device='CPU'; T in [DT_UINT16]
device='CPU'; T in [DT_INT16]
device='CPU'; T in [DT_UINT8]
device='CPU'; T in [DT_INT8]
device='CPU'; T in [DT_INT32]
device='CPU'; T in [DT_HALF]
device='CPU'; T in [DT_BFLOAT16]
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_DOUBLE]
device='CPU'; T in [DT_COMPLEX64]
device='CPU'; T in [DT_COMPLEX128]
device='CPU'; T in [DT_BOOL]
device='CPU'; T in [DT_STRING]
device='CPU'; T in [DT_RESOURCE]
device='CPU'; T in [DT_VARIANT]
device='CPU'; T in [DT_QINT8]
device='CPU'; T in [DT_QUINT8]
device='CPU'; T in [DT_QINT32]
device='CPU'; T in [DT_FLOAT8_E5M2]
device='CPU'; T in [DT_FLOAT8_E4M3FN]
[[{{node ResourceStridedSliceAssign}}]] [Op:ResourceStridedSliceAssign] name: strided_slice/_assign
I am starting to regret my Macbook purchase. There are so many issues with tensorflow-metal:
ADAM is slow
Inconsistent values with CPU
And now this, I saw a post regarding this but that was one year old. So, Macbooks are not even good for learning anymore?
I created a macOS 14 VM using https://github.com/s-u/macosvm which uses the Virtualization Framework. I want to check if I can use paravirtualized graphics for tensorflow workloads.
I followed the steps from https://developer.apple.com/metal/tensorflow-plugin/ but when I run the script from step 4. Verify, I get a segmentation fault (see below).
Did anyone try to get this kind of GPU compute in a VM and succeed?
/Users/teuf/venv-metal/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
2023-11-20 07:41:11.723578: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple Paravirtual device
2023-11-20 07:41:11.723620: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 10.00 GB
2023-11-20 07:41:11.723626: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 0.50 GB
2023-11-20 07:41:11.723700: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-11-20 07:41:11.723968: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
zsh: segmentation fault python3 ./tensorflow-test.py
Thread 0 Crashed:: Dispatch queue: metal gpu stream
0 MPSCore 0x1999598f8 MPSDevice::GetMPSLibrary_DoNotUse(MPSLibraryInfo const*) + 92
1 MPSCore 0x19995c544 0x199927000 + 218436
2 MPSCore 0x19995c908 0x199927000 + 219400
3 MetalPerformanceShadersGraph 0x1fb696a58 0x1fb583000 + 1129048
4 MetalPerformanceShadersGraph 0x1fb6f0cc8 0x1fb583000 + 1498312
5 MetalPerformanceShadersGraph 0x1fb6ef2dc 0x1fb583000 + 1491676
6 MetalPerformanceShadersGraph 0x1fb717ea0 0x1fb583000 + 1658528
7 MetalPerformanceShadersGraph 0x1fb717ce4 0x1fb583000 + 1658084
8 MetalPerformanceShadersGraph 0x1fb6edaac 0x1fb583000 + 1485484
9 MetalPerformanceShadersGraph 0x1fb7a85e0 0x1fb583000 + 2250208
10 MetalPerformanceShadersGraph 0x1fb7a79f0 0x1fb583000 + 2247152
11 MetalPerformanceShadersGraph 0x1fb6602b4 0x1fb583000 + 905908
12 MetalPerformanceShadersGraph 0x1fb65f7b0 0x1fb583000 + 903088
13 libmetal_plugin.dylib 0x1156dfdcc invocation function for block in metal_plugin::runMPSGraph(MetalStream*, MPSGraph*, NSDictionary*, NSDictionary*) + 164
14 libdispatch.dylib 0x18e79b910 _dispatch_client_callout + 20
15 libdispatch.dylib 0x18e7aacc4 _dispatch_lane_barrier_sync_invoke_and_complete + 56
16 libmetal_plugin.dylib 0x1156dfd14 metal_plugin::runMPSGraph(MetalStream*, MPSGraph*, NSDictionary*, NSDictionary*) + 108
17 libmetal_plugin.dylib 0x115606634 metal_plugin::MPSStatelessRandomUniformOp<float>::ProduceOutput(metal_plugin::OpKernelContext*, metal_plugin::Tensor*) + 876
18 libmetal_plugin.dylib 0x115607620 metal_plugin::MPSStatelessRandomOpBase::Compute(metal_plugin::OpKernelContext*) + 620
19 libmetal_plugin.dylib 0x1156061f8 void metal_plugin::ComputeOpKernel<metal_plugin::MPSStatelessRandomUniformOp<float>>(void*, TF_OpKernelContext*) + 44
20 libtensorflow_framework.2.dylib 0x10b807354 tensorflow::PluggableDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) + 148
21 libtensorflow_framework.2.dylib 0x10b7413e0 tensorflow::(anonymous namespace)::SingleThreadedExecutorImpl::Run(tensorflow::Executor::Args const&) + 2100
22 libtensorflow_framework.2.dylib 0x10b70b820 tensorflow::FunctionLibraryRuntimeImpl::RunSync(tensorflow::FunctionLibraryRuntime::Options, unsigned long long, absl::lts_20230125::Span<tensorflow::Tensor const>, std::__1::vector<tensorflow::Tensor, std::__1::allocator<tensorflow::Tensor>>*) + 420
23 libtensorflow_framework.2.dylib 0x10b715668 tensorflow::ProcessFunctionLibraryRuntime::RunMultiDeviceSync(tensorflow::FunctionLibraryRuntime::Options const&, unsigned long long, std::__1::vector<std::__1::variant<tensorflow::Tensor, tensorflow::TensorShape>, std::__1::allocator<std::__1::variant<tensorflow::Tensor, tensorflow::TensorShape>>>*, std::__1::function<absl::lts_20230125::Status (tensorflow::ProcessFunctionLibraryRuntime::ComponentFunctionData const&, tensorflow::ProcessFunctionLibraryRuntime::InternalArgs*)>) const + 1336
24 libtensorflow_framework.2.dylib 0x10b71a8a4 tensorflow::ProcessFunctionLibraryRuntime::RunSync(tensorflow::FunctionLibraryRuntime::Options const&, unsigned long long, absl::lts_20230125::Span<tensorflow::Tensor const>, std::__1::vector<tensorflow::Tensor, std::__1::allocator<tensorflow::Tensor>>*) const + 848
25 libtensorflow_cc.2.dylib 0x2801b5008 tensorflow::KernelAndDeviceFunc::Run(tensorflow::ScopedStepContainer*, tensorflow::EagerKernelArgs const&, std::__1::vector<std::__1::variant<tensorflow::Tensor, tensorflow::TensorShape>, std::__1::allocator<std::__1::variant<tensorflow::Tensor, tensorflow::TensorShape>>>*, tsl::CancellationManager*, std::__1::optional<tensorflow::EagerFunctionParams> const&, std::__1::optional<tensorflow::ManagedStackTrace> const&, tsl::CoordinationServiceAgent*) + 572
26 libtensorflow_cc.2.dylib 0x28016613c tensorflow::EagerKernelExecute(tensorflow::EagerContext*, absl::lts_20230125::InlinedVector<tensorflow::TensorHandle*, 4ul, std::__1::allocator<tensorflow::TensorHandle*>> const&, std::__1::optional<tensorflow::EagerFunctionParams> const&, tsl::core::RefCountPtr<tensorflow::KernelAndDevice> const&, tensorflow::GraphCollector*, tsl::CancellationManager*, absl::lts_20230125::Span<tensorflow::TensorHandle*>, std::__1::optional<tensorflow::ManagedStackTrace> const&) + 452
27 libtensorflow_cc.2.dylib 0x2801708ec tensorflow::ExecuteNode::Run() + 396
28 libtensorflow_cc.2.dylib 0x2801b0118 tensorflow::EagerExecutor::SyncExecute(tensorflow::EagerNode*) + 244
29 libtensorflow_cc.2.dylib 0x280165ac8 tensorflow::(anonymous namespace)::EagerLocalExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) + 2580
30 libtensorflow_cc.2.dylib 0x2801637a8 tensorflow::DoEagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) + 416
31 libtensorflow_cc.2.dylib 0x2801631e8 tensorflow::EagerOperation::Execute(absl::lts_20230125::Span<tensorflow::AbstractTensorHandle*>, int*) + 132