tensorflow-metal problem with loss function

Question

Created Sep ’21

Replies 1

Boosts 0

Participants 1

When I use standard tensorflow installation the training (fit) process works well. I got:

2021-09-03 15:36:26.802170: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) Epoch 1/20 1424/1424 [==============================] - 656s 460ms/step - loss: 0.0173 - binary_accuracy: 0.9913 - accuracy: 0.6272 - val_loss: 0.0198 - val_binary_accuracy: 0.9933 - val_accuracy: 0.6535 Epoch 2/20 1424/1424 [==============================] - 660s 464ms/step - loss: 0.0173 - binary_accuracy: 0.9913 - accuracy: 0.6250 - val_loss: 0.0218 - val_binary_accuracy: 0.9932 - val_accuracy: 0.6450 Epoch 3/20 1424/1424 [==============================] - 637s 447ms/step - loss: 0.0174 - binary_accuracy: 0.9913 - accuracy: 0.6224 - val_loss: 0.0204 - val_binary_accuracy: 0.9932 - val_accuracy: 0.6451 Epoch 4/20 1424/1424 [==============================] - 633s 444ms/step - loss: 0.0173 - binary_accuracy: 0.9913 - accuracy: 0.6244 - val_loss: 0.0237 - val_binary_accuracy: 0.9931 - val_accuracy: 0.6211 Epoch 5/20 1424/1424 [==============================] - 616s 433ms/step - loss: 0.0173 - binary_accuracy: 0.9913 - accuracy: 0.6243 - val_loss: 0.0198 - val_binary_accuracy: 0.9934 - val_accuracy: 0.6487

When I train the same model on the same train set on installation with tensorflow-macos + tensorflow-metal I receive:

2021-09-03 21:59:18.973547: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) Epoch 1/20 2021-09-03 21:59:20.298178: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 175/175 [==============================] - ETA: 0s - loss: nan - binary_accuracy: 0.9814 - accuracy: 1.7905e-042021-09-03 21:59:50.110699: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 175/175 [==============================] - 34s 162ms/step - loss: nan - binary_accuracy: 0.9814 - accuracy: 1.7905e-04 - val_loss: nan - val_binary_accuracy: 0.9910 - val_accuracy: 0.0000e+00 Epoch 2/20 175/175 [==============================] - 28s 158ms/step - loss: nan - binary_accuracy: 0.9867 - accuracy: 0.0000e+00 - val_loss: nan - val_binary_accuracy: 0.9910 - val_accuracy: 0.0000e+00 Epoch 3/20 175/175 [==============================] - 28s 158ms/step - loss: nan - binary_accuracy: 0.9867 - accuracy: 0.0000e+00 - val_loss: nan - val_binary_accuracy: 0.9910 - val_accuracy: 0.0000e+00 Epoch 4/20 175/175 [==============================] - 28s 158ms/step - loss: nan - binary_accuracy: 0.9867 - accuracy: 0.0000e+00 - val_loss: nan - val_binary_accuracy: 0.9910 - val_accuracy: 0.0000e+00 Epoch 5/20 175/175 [==============================] - 28s 161ms/step - loss: nan - binary_accuracy: 0.9867 - accuracy: 0.0000e+00 - val_loss: nan - val_binary_accuracy: 0.9910 - val_accuracy: 0.0000e+00

The problem is that I have "nan" both for the loss and val_loss functions.

Boost

Answer 1

Krzysztof_Malicki OP

Sep ’21

I am adding the information generated by the module on startup:

Init Plugin Init Graph Optimizer Init Kernel

Using TensorFlow version 2.5.0

2021-09-03 17:40:33.259162: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Metal device set to: AMD Radeon R9 M290X

systemMemory: 32.00 GB maxCacheSize: 1.00 GB

2021-09-03 17:40:33.259801: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2021-09-03 17:40:33.260023: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )

0