TensorFlow hangs in session.run

This is a new neural model I implemented, and I want to do training. It's modified based on an existing attention-based encoder-decoder model, where everything works fine.

In the new model, it just hangs in session.run and does not do anything. I also cannot interrupt it. It hangs inside the TensorFlow C++ code.

This seems to be specific for Mac M1 hardware. I cannot reproduce the problem on other hardware or environments.

I already posted this here but it was suggested to also post it here.

So far I don't have a minimal example, and this will be quite a big effort to generate one, as this is some very complex model. But here some relevant details:

  • This is based on RETURNN.
  • We still use graph-mode.
  • I tested both with control flow v1 (calling disable_control_flow_v2) and control flow v2. It hangs in both cases.
  • I tested using tfdbg or enable_dump_debug_info. It crashes then with a segfault.
  • I get a number of other warnings, which are maybe related. See below.

To reproduce:

With control flow v2:

2023-02-17 10:02:03.997491: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT32
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_FLOAT
    }
  }
}

	while inferring type of node 'output/rec/while/body/_38/output/rec/prev_target_embed_moved_input/cond/output/_1608'

2023-02-17 10:34:46.595736: W tensorflow/c/c_api.cc:291] Operation '{name:'global_step' id:1961 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-02-17 10:35:56.799620+0100 python3[5197:2744697] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
...
2023-02-17 10:36:01.801307+0100 python3[5197:2744697] Execution of the command buffer was aborted due to an error during execution. Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored)
...

(Related: https://github.com/tensorflow/tensorflow/issues/57052)

With control flow v1:

2023-02-17 10:10:01.733679: W tensorflow/c/c_api.cc:291] Operation '{name:'global_step' id:1528 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-02-17 10:10:14.257716+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.257754+0100 python3[3727:2732582] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258366+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258504+0100 python3[3727:2732582] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258541+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258587+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:19.258726+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:19.258784+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
TensorFlow hangs in session.run
 
 
Q