tensorflow-metal

RSS for tag

TensorFlow accelerates machine learning model training with Metal on Mac GPUs.

tensorflow-metal Documentation

Posts under tensorflow-metal tag

249 Posts
Sort by:
Post not yet marked as solved
3 Replies
141 Views
Hi all. Trying to run the intro example to STS of tensorflow. The introductory notebook https://github.com/tensorflow/probability/blob/main/tensorflow_probability/examples/jupyter_notebooks/Structural_Time_Series_Modeling_Case_Studies_Atmospheric_CO2_and_Electricity_Demand.ipynb Gets an unimplemented error when calculating the loss curve. Seems to work for everything else. Has anybody gotten this intro example to work? Thank you
Posted Last updated
.
Post not yet marked as solved
1 Replies
149 Views
I am trying to run the notebook https://www.tensorflow.org/text/tutorials/text_classification_rnn from the TensorFlow website. The code has LSTM and Bidirectional layers When the GPU is enabled the time is 56 minutes/epoch. When I am only using the CPU is 264 seconds/epoch. I am using MacBook Pro 14 (10 CPU cores, 16 GPU cores) and TensorFlow-macos 2.8 with TensorFlow-metal 0.5.0.  I face the same problem for TensorFlow-macos 2.9 too. My environment has: tensorflow-macos          2.8.0   tensorflow-metal          0.5.0  tensorflow-text           2.8.1   tensorflow-datasets       4.6.0                    tensorflow-deps           2.8.0                          tensorflow-hub            0.12.0                       tensorflow-metadata       1.8.0                                         When I am using CNNs the GPU is fully enabled and 3-4 times faster than when only using the CPU.  Any idea where is the problem when using LSTMs and Bidirectional?
Posted Last updated
.
Post not yet marked as solved
5 Replies
727 Views
Hi, I have started experimenting with using my MBP with M1 Pro (10CPU cores / 16 GPU cores) for Tensorflow. Two things were odd/noteworthy: I've compared training models in a tensorflow environment with tensorflow-metal, running the code with either with tf.device('gpu:0'): or with tf.device('cpu:0'): as well as in an environment without the tensorflow-metal plugin. Specifiying the device as CPU in tf-metal almost always leads to a lot longer training times compared to specifying using the GPU, but also compared to running the standard (non-metal environment). Also, the GPU was running at quite high power despite of telling TF to use the CPU. Is this an intended or expected behaviour? As it will be preferable to use the non-metal environment when not benefitting from a GPU. Secondly, at small batch sizes, the GPU power in system stats increases with the batch size, as expected. However, when chaning the batch size from 9 to 10 (this appears like a hard step specifically at this number), GPU power drops by about half, and training time doubles. Increasing batch size from about 10 leads again to a gradual increase in GPU power, on my model the same GPU power as batchsize=9 is reached only at about batchsize=50. Making GPU acceleration using batch-sizes from 10 to about 50 rather useless. I've noticed this behavior on several models, which makes me wonder that this is a general tf-metal behaviour. As a result, I've only been able to benefit from GPU acceleration at a batchsize of 9 and above 100. Once again, is this intended or to be expected?
Posted
by iDron.
Last updated
.
Post not yet marked as solved
4 Replies
465 Views
First of all, as I understand that this is a problem related with tensorflow addons, I've been in contact with tfa developers (https://github.com/tensorflow/addons/issues/2578), and this issue only happens in M1, so they think it has to do with Apple tensorflow-metal. I've been getting spurious errors while doing model.fit with the Lookahead optimizer (I'm doing fine-tuning with big datasets, and my code just breaks while fitting to different files, and in a not-reproducible way, i.e. each time I run it it breaks on a different file, and on different operations). I can see that these errors are undoubtedly related to the Lookahead optimizer. Let me try to explain this new info in a clear manner. I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment: pylast:tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0 py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0 The base code is always the same, I use tf.config.set_soft_device_placement(True) and also with tf.device('/cpu:0'): in every call to tensorflow, otherwise I get errors. As explained before, in my code, I just load a model, and fine-tune it to each file of a dataset. Here are a pair of example error outputs (obtained with the pylast conda environment): File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True, File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: Detected at node 'Lookahead/Lookahead/update_64/mul_11' defined at (most recent call last): File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True, File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, **kwargs) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1409, in fit tmp_logs = self.train_function(iterator) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1051, in train_function return step_function(self, iterator) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1040, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1030, in run_step outputs = model.train_step(data) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 893, in train_step self.optimizer.minimize(loss, self.trainable_variables, tape=tape) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 539, in minimize return self.apply_gradients(grads_and_vars, name=name) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 104, in apply_gradients return super().apply_gradients(grads_and_vars, name, **kwargs) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 678, in apply_gradients return tf.__internal__.distribute.interim.maybe_merge_call( File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 723, in _distributed_apply update_op = distribution.extended.update( File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 706, in apply_grad_to_update_var update_op = self._resource_apply_dense(grad, var, **apply_kwargs) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 130, in _resource_apply_dense train_op = self._optimizer._resource_apply_dense( File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/rectified_adam.py", line 249, in _resource_apply_dense coef["r_t"] * m_corr_t / (v_corr_t + coef["epsilon_t"]), Node: 'Lookahead/Lookahead/update_64/mul_11' Incompatible shapes: [0] vs. [5,40,20] [[{{node Lookahead/Lookahead/update_64/mul_11}}]] [Op:__inference_train_function_30821] and Another error output
Posted
by mrt77.
Last updated
.
Post not yet marked as solved
1 Replies
267 Views
I already installed tensorflow latest version using the documentation given (link). But when I tried to run notebook with command "%tensorflow_version 2.x" , its giving error "UsageError: Line magic function %tensorflow_version not found.". Please tell me, what to do ?
Posted
by 006.
Last updated
.
Post not yet marked as solved
9 Replies
2.7k Views
I m using macbook air 13" and pursuing Artificial Intelligence course I m facing huge problem with Jupyter notebook post installing tensorflow as the kernel keeps dying and I have literally tried solution in every article/resource on Google Nothing seems to be fixing the issue. It began only when I started to run code for Convolutional Neural Network Please help me fix this issue and understand why its not getting fixed At the moment, I can only think of trading Macbook for Windows Laptop but It will be very hard as I have not had hands-on Windows laptop Hope to hear back soon Thanks Keshav Lal Seth
Posted Last updated
.
Post not yet marked as solved
0 Replies
106 Views
I am trying to run distributed training with TF-Metal and two M1-Ultra Studios. On connecting them via thunderbolt cable, I am able to see the device in the "About Mac" column but TensorFlow doesn't pick that GPU up. Commands I am using:- mirrored_strategy = tf.distribute.MirroredStrategy() Output:- INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) TF-Metal version: 2.9.2
Posted Last updated
.
Post marked as solved
2 Replies
140 Views
I recently wrote some code for a basic GAN (I am learning about neural networks, so not an expert), and got very strange results. Unable to debug, I tested someone else's code that I know works, and still got the same results. When running a GAN to generate digits (from the MNIST dataset) the images produced each epoch are identical to each other, and don't resemble digits at all. An example of the images produced can be seen below. Rerunning the same code on Google Colab, and on my machine locally (with standard tensorflow, i.e. without the metal plugin) gives expected results of images resembling digits. The code is used to test this can be found here: https://github.com/PacktPublishing/Deep-Learning-with-TensorFlow-2-and-Keras/blob/master/Chapter%206/VanillaGAN.ipynb I am using these versions of relevant software: tensorflow-metal 0.5.0; tensorflow-macos 2.9.2; macOS Monterey 12.3; I would be grateful if Apple engineers could advise, or give a timeframe for a solution please.
Posted
by 09jtip.
Last updated
.
Post not yet marked as solved
1 Replies
123 Views
i am using M1 macbook air 2020. macOS Monterey version 12.4 i have been trying long and hard to install the n2v package https://github.com/juglab/n2v i first created a virtual env using conda and installed tensorflow using https://developer.apple.com/metal/tensorflow-plugin/ afterwards, i did pip install n2v which led to its own set of problems. the main issues were imagecodecs and h5py. i resolved this issues and managed to install n2v. however there are these 2 issues. n2v 0.3.1 requires keras<2.4.0,>=2.1.1, but you have keras 2.9.0 which is incompatible. tensorflow-macos 2.9.2 requires keras<2.10.0,>=2.9.0rc0, but you have keras 2.3.1 which is incompatible. as you can see this puts me in a tightspot. when i first installed tensorflow using the link, my tensorflow version was 2.9.2 and keras was 2.9.0. but then n2v 0.3.1 is not compatible keras 2.9.0. this requires me to downgrade keras but then tensorflow wouldnt work. i have been trying to figure out how to solve this for days. the tricky part about resolving these issues are there the help i find online must be for a macos context. using the macbook really complicates things as my understanding is that i cannot install tensorflow in a normal way but have to use tensorflow-macos. please please help me! any help is greatly appreciated. thank you very much!
Posted Last updated
.
Post not yet marked as solved
3 Replies
510 Views
I've been using tensorflow-metal together with jupyter lab. Sometimes during training, the notebook would stop printing training progress. The training process seems dead as interrupting kernel wouldn't respond. I have to restart kernel and train again. The problem doesn't always occur, and I couldn't tell what was the cause. Until recently I started using tensorflow-probability. I could 100% reproduce the problem on my machine. Here is the demo to reproduce the problem. import numpy as np import tensorflow as tf #tf.config.set_visible_devices([], 'GPU') import tensorflow_probability as tfp from tensorflow_probability import distributions as tfd from tensorflow_probability import sts STEPS = 10000 BATCH = 64 noise = np.random.random(size=(BATCH, STEPS)) * 0.5 signal = np.sin((     np.broadcast_to(np.arange(STEPS), (BATCH, STEPS)) / (10 + np.random.random(size=(BATCH, 1)))     + np.random.random(size=(BATCH, 1)) ) * np.pi * 2) data = noise + signal data = data.astype(np.float32) # float64 would fail under GPU training, no idea why def build_model(observed):     season = sts.Seasonal(         num_seasons=10,         num_steps_per_season=1,         observed_time_series=observed,     )     model = sts.Sum([         season,     ], observed_time_series=observed)     return model model = build_model(data) variational_posteriors = sts.build_factored_surrogate_posterior(model=model) loss_curve = tfp.vi.fit_surrogate_posterior( target_log_prob_fn=model.joint_distribution(observed_time_series=data).log_prob,     surrogate_posterior=variational_posteriors,     optimizer=tf.optimizers.Adam(learning_rate=0.1),     num_steps=5, ) print('loss', loss_curve) After starting the demo, using python demo.py, I can observe the python process is running, consuming cpu and gpu. And then, when the cpu and gpu usage drops to zero, it never prints anything. The process doesn't responding to ctrl+c, and I have to force kill it. I use Activity Monitor to sample the "dead" process. It shows a lot of threads are waiting, including main thread. ... + 2228 _pthread_cond_wait (in libsystem_pthread.dylib) + 1228 [0x180659808] + 2228 __psynch_cvwait (in libsystem_kernel.dylib) + 8 [0x1806210c0] And some metal threads ... + 2228 tensorflow::PluggableDeviceContext::CopyDeviceTensorToCPU(tensorflow::Tensor const*, absl::lts_20210324::string_view, tensorflow::Device*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>) (in _pywrap_tensorflow_internal.so) + 152 [0x28006290c] + 2228 tensorflow::PluggableDeviceUtil::CopyPluggableDeviceTensorToCPU(tensorflow::Device*, tensorflow::DeviceContext const*, tensorflow::Tensor const*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>) (in _pywrap_tensorflow_internal.so) + 320 [0x2800689bc] + 2228 stream_executor::Stream::ThenMemcpy(void*, stream_executor::DeviceMemoryBase const&, unsigned long long) (in _pywrap_tensorflow_internal.so) + 116 [0x286f0b08c] + 2228 stream_executor::(anonymous namespace)::CStreamExecutor::Memcpy(stream_executor::Stream*, void*, stream_executor::DeviceMemoryBase const&, unsigned long long) (in _pywrap_tensorflow_internal.so) + 128 [0x2816595c8] + 2228 metal_plugin::memcpy_dtoh(SP_Device const*, SP_Stream_st*, void*, SP_DeviceMemoryBase const*, unsigned long long, TF_Status*) (in libmetal_plugin.dylib) + 444 [0x126acc224] + 2228 ??? (in AGXMetalG13X) load address 0x1c5cd0000 + 0x1c5ad8 [0x1c5e95ad8] + 2228 -[IOGPUMetalBuffer initWithDevice:pointer:length:options:sysMemSize:gpuAddress:args:argsSize:deallocator:] (in IOGPU) + 332 [0x19ac3ae3c] + 2228 -[IOGPUMetalResource initWithDevice:remoteStorageResource:options:args:argsSize:] (in IOGPU) + 476 [0x19ac469f8] + 2228 IOGPUResourceCreate (in IOGPU) + 224 [0x19ac4c970] + 2228 IOConnectCallMethod (in IOKit) + 236 [0x183104bc4] + 2228 io_connect_method (in IOKit) + 440 [0x183104da8] + 2228 mach_msg (in libsystem_kernel.dylib) + 76 [0x18061dd00] + 2228 mach_msg_trap (in libsystem_kernel.dylib) + 8 [0x18061d954] I'm no expert but it looks like there is deadlock. Training with cpu works by uncommenting line 4. Here are my configurations. MacBook Pro (14-inch, 2021) Apple M1 Pro 32 GB macOS 12.2.1 (21D62) tensorflow-deps           2.8.0   tensorflow-macos          2.8.0   tensorflow-metal          0.4.0   tensorflow-probability    0.16.0
Posted
by wangcheng.
Last updated
.
Post not yet marked as solved
6 Replies
967 Views
After installing tensorflow-metal PluggableDevice according to Getting Started with tensorflow-metal PluggableDevice I have tested this DCGAN example: https://www.tensorflow.org/tutorials/generative/dcgan. Everything was working perfectly until I decided tu upgrade macOS from 12.0.1 to 12.1. Before the final result after 50 epoch was like on picture1 below , after upgrade is like on picture2 below . I am using: TensrofFlow 2.7.0 tensorflow-metal-0.3.0 python3.9 I hope this question will also help Apple to improve Metal PluggableDevice. I can't wait to use it in my research.
Posted Last updated
.
Post not yet marked as solved
2 Replies
392 Views
I'm trying to get TensorFlow with Metal support running on my iMac (2017, Radeon 580 Pro) following these instructions. However, simply importing tensorflow ( import tensorflow ) results in the following error with the Python console crashing: 2022-05-27 11:46:12.419950: F tensorflow/c/experimental/stream_executor/stream_executor.cc:808] Non-OK-status: stream_executor::MultiPlatformManager::RegisterPlatform( std::move(cplatform)) status: INTERNAL: platform is already registered with name: "METAL" Abort trap: 6 Versions: macOS 12.3, Python 3.8.13, tensorflow-macos 2.9.0, tensorflow-metal 0.5.0
Posted
by TechTobi.
Last updated
.
Post not yet marked as solved
2 Replies
319 Views
Training Top2vec with embedding_batch_size=256 crashed OS X 12.3.1 tensorflow_macos 2.8.0, tensorflow_metal 0.4.0 Anaconda Python 3.8.5 % pip show tensorflow_macos WARNING: Ignoring invalid distribution -umpy (/Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages) Name: tensorflow-macos Version: 2.8.0 Summary: TensorFlow is an open source machine learning framework for everyone. Home-page: https://www.tensorflow.org/ Author: Google Inc. Author-email: packages@tensorflow.org License: Apache 2.0 Location: /Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, keras-preprocessing, libclang, numpy, opt-einsum, protobuf, setuptools, six, tensorboard, termcolor, tf-estimator-nightly, typing-extensions, wrapt Required-by: (tensorflow-metal) (base) davidlaxer@x86_64-apple-darwin13 top2vec % pip show tensorflow_metal WARNING: Ignoring invalid distribution -umpy (/Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages) Name: tensorflow-metal Version: 0.4.0 Summary: TensorFlow acceleration for Mac GPUs. Home-page: https://developer.apple.com/metal/tensorflow-plugin/ Author: Author-email: License: MIT License. Copyright © 2020-2021 Apple Inc. All rights reserved. Location: /Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages Requires: six, wheel Required-by: To train the model with embedding_model="universal-sentence-encoder", you'll have to download universal-sentence-encoder_4. top2vec_trained = Top2Vec(documents=papers_filtered_df.text.tolist(), split_documents=True, **embedding_batch_size=256,** embedding_model="universal-sentence-encoder", use_embedding_model_tokenizer=True, embedding_model_path="/Users/davidlaxer/Downloads/universal-sentence-encoder_4", workers=8) Here's the project: https://github.com/ddangelov/Top2Vec Here's the Jupyter notebook: https://github.com/ddangelov/Top2Vec/blob/master/notebooks/CORD-19_top2vec.ipynb You'll have to load the COVID-19 data set from Kaggle here: https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge I set filter size to 1,000: def filter_short(papers_df): papers_df["token_counts"] = papers_df["text"].str.split().map(len) papers_df = **papers_df[papers_df.token_counts>1000].reset_index(drop=True)** papers_df.drop('token_counts', axis=1, inplace=True) return papers_df Trace panic(cpu 8 caller 0xffffff8020d449ad): userspace watchdog timeout: no successful checkins from WindowServer in 120 seconds service: logd, total successful checkins since wake (127621 seconds ago): 12763, last successful checkin: 0 seconds ago service: WindowServer, total successful checkins since wake (127621 seconds ago): 12751, last successful checkin: 120 seconds ago service: remoted, total successful checkins since wake (127621 seconds ago): 12763, last successful checkin: 0 [Trace](https://developer.apple.com/forums/content/attachment/d17c2c9b-569b-4c1a-9c61-892ced7f785b)
Posted
by dbl001.
Last updated
.
Post not yet marked as solved
1 Replies
162 Views
Hello Everyone! I recently tried installing TensorFlow following this guide: https://developer.apple.com/metal/tensorflow-plugin/ on my M1 Pro MacBook running 12.4 Monterey. However, I'm faced with the following error message when importing: File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py:29, in <module> 26 from tensorflow.python.lib.core import _pywrap_bfloat16 27 from tensorflow.python.util.tf_export import tf_export ---> 29 _np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type() 32 @tf_export("dtypes.DType", "DType") 33 class DType(_dtypes.DType): 34 """Represents the type of the elements in a `Tensor`. 35 36 `DType`'s are used to specify the output data type for operations which (...) 46 See `tf.dtypes` for a complete list of `DType`'s defined. 47 """ TypeError: Unable to convert function return value to a Python type! The signature was () -> handle I've checked that my tensoflow-dep has a version of 2.9.0, tensorflow-macos 2.9.2, and tensor flow-metal 0.5.0, with numpy having its latest version of 1.22.4, all in my env. Anyone knows what's up?
Posted
by bckhm.
Last updated
.
Post not yet marked as solved
1 Replies
240 Views
Error is here InvalidArgumentError: Cannot assign a device for operation model_1/conv2d_1/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node model_1/conv2d_1/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[] ResourceApplyAdaMax: CPU ReadVariableOp: GPU CPU _Arg: GPU CPU Colocation members, user-requested devices, and framework assigned devices, if any: model_1_conv2d_1_conv2d_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0 adamax_adamax_update_resourceapplyadamax_m (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0 adamax_adamax_update_resourceapplyadamax_v (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0 model_1/conv2d_1/Conv2D/ReadVariableOp (ReadVariableOp) Adamax/Adamax/update/ResourceApplyAdaMax (ResourceApplyAdaMax) /job:localhost/replica:0/task:0/device:GPU:0 [[{{node model_1/conv2d_1/Conv2D/ReadVariableOp}}]] [Op:__inference_train_function_5897]
Posted
by Blooo513.
Last updated
.
Post not yet marked as solved
1 Replies
244 Views
Dear developers, I am encountering a bug when using TensorFlow-Metal. When I do the following: random_1 = tf.random.Generator.from_seed(42) random_1 = random_1.normal(shape=(3,2)) random_1 I got the following error: NotFoundError: No registered 'RngReadAndSkip' OpKernel for 'GPU' devices compatible with node {{node RngReadAndSkip}} . Registered: device='CPU' [Op:RngReadAndSkip] But it works fine when creating random tensors with cpu like the following: with tf.device('/cpu:0'):     random_1 = tf.random.Generator.from_seed(42)     random_1 = random_1.normal(shape=(3,2))  random_1
Posted
by Leozz99.
Last updated
.
Post not yet marked as solved
1 Replies
157 Views
Hi, I am using a Mac with M1 Pro. I want to use RandomCrop from tensorflow.keras.layers but while training I get the error below. If I understood correctly, it seems that RngReadAndSkip is not implemented for the GPU. InvalidArgumentError: Cannot assign a device for operation model/data_augmentation/random_crop/cond/model_data_augmentation_random_crop_cond_input_1/_6: Could not satisfy explicit device specification '' because the node {{colocation_node model/data_augmentation/random_crop/cond/model_data_augmentation_random_crop_cond_input_1/_6}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[] RngReadAndSkip: CPU Identity: GPU CPU Switch: GPU CPU _Arg: GPU CPU Colocation members, user-requested devices, and framework assigned devices, if any: model_data_augmentation_random_crop_cond_input_1 (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0 model/data_augmentation/random_crop/cond/model_data_augmentation_random_crop_cond_input_1/_6 (Switch) Func/model/data_augmentation/random_crop/cond/then/_0/input/_47 (Identity) model/data_augmentation/random_crop/cond/then/_0/model/data_augmentation/random_crop/cond/stateful_uniform/RngReadAndSkip (RngReadAndSkip) Func/model/data_augmentation/random_crop/cond/else/_1/input/_52 (Identity) Python version: $ python --version --version Python 3.8.13 (default, Mar 28 2022, 06:13:39) [Clang 12.0.0 ] Libraries used: $ conda list | grep tensorflow tensorflow-addons 0.17.0 pypi_0 pypi tensorflow-deps 2.9.0 0 apple tensorflow-estimator 2.9.0 pypi_0 pypi tensorflow-macos 2.9.2 pypi_0 pypi tensorflow-metal 0.5.0 pypi_0 pypi Is there any workaround? Or anything I can do to help fixing this? Thanks
Posted
by EdoAbati.
Last updated
.
Post not yet marked as solved
1 Replies
145 Views
I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem. import tensorflow as tf #dev = '/cpu:0' dev = '/gpu:0' epochs = 1000 batch_size = 32 hidden = 128 mnist = tf.keras.datasets.mnist train, _ = mnist.load_data() x_train, y_train = train[0] / 255.0, train[1] with tf.device(dev): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(hidden, activation='relu')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(hidden, activation='relu')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(10, activation='softmax')) model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs) Test configurations are: MacBook Air M1 macOS 12.4 tensorflow-deps 2.9 tensorflow-macos 2.9.2 tensorflow-metal 0.5.0 With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following. GPU to CPU remove Dropout layers downgrade tensorflow-metal to 0.4
Posted
by masa6s.
Last updated
.