GPU much slower than CPU for LSTMs and bidirectional in TensorFlow 2.8

I am trying to run the notebook https://www.tensorflow.org/text/tutorials/text_classification_rnn from the TensorFlow website.

The code has LSTM and Bidirectional layers

When the GPU is enabled the time is 56 minutes/epoch.

When I am only using the CPU is 264 seconds/epoch.

I am using MacBook Pro 14 (10 CPU cores, 16 GPU cores) and TensorFlow-macos 2.8 with TensorFlow-metal 0.5.0.  I face the same problem for TensorFlow-macos 2.9 too.

My environment has:

tensorflow-macos          2.8.0  

tensorflow-metal          0.5.0 

tensorflow-text           2.8.1  

tensorflow-datasets       4.6.0                   

tensorflow-deps           2.8.0                         

tensorflow-hub            0.12.0                      

tensorflow-metadata       1.8.0                    

                   

When I am using CNNs the GPU is fully enabled and 3-4 times faster than when only using the CPU. 

Any idea where is the problem when using LSTMs and Bidirectional?

Post not yet marked as solved Up vote post of vasileiosgk Down vote post of vasileiosgk
148 views
  • Post a bug report on the tensorflow github project

  • Same issue here: link

Add a Comment

Replies

Hi @vasileiosgk

Thanks for reporting the problem and providing a script to produce it. I'll take a look at the issue.

To me initially this seems like the Bidirectional LSTM kernel ends up falling back to the Python level un-fused implementation for the operation which is unfortunately intolerably slow when called with the pluggable device (GPU) at the moment. However the kernel here should be taking the faster implementation since it satisfies the "cuDNN conditions" described on the Tensorflow documentation page for the op which allows us to use to fused implementation. So this looks like a bug on the tensorflow-metal side. I'll update here once I've confirmed this to be the case.