This is a new neural model I implemented, and I want to do training. It's modified based on an existing attention-based encoder-decoder model, where everything works fine.
In the new model, it just hangs in session.run
and does not do anything. I also cannot interrupt it. It hangs inside the TensorFlow C++ code.
This seems to be specific for Mac M1 hardware. I cannot reproduce the problem on other hardware or environments.
I already posted this here but it was suggested to also post it here.
So far I don't have a minimal example, and this will be quite a big effort to generate one, as this is some very complex model. But here some relevant details:
- This is based on RETURNN.
- We still use graph-mode.
- I tested both with control flow v1 (calling
disable_control_flow_v2
) and control flow v2. It hangs in both cases. - I tested using tfdbg or
enable_dump_debug_info
. It crashes then with a segfault. - I get a number of other warnings, which are maybe related. See below.
To reproduce:
- Code: https://github.com/rwth-i6/i6_experiments/blob/81bcef39b5829aa43b84bcab4b4fa03f82fc3bc5/users/zeyer/experiments/exp2023_02_16_chunked_attention/demo_returnn_config.py
- Checkout the i6_experiments repo, commit 81bcef39b5829aa43b84bcab4b4fa03f82fc3bc5.
- Checkout RETURNN, commit 2ed598443f22de42599a0fee9bc43fbb5e0abec2.
- Run:
python3 returnn/rnn.py i6_experiments/users/zeyer/experiments/exp2023_02_16_chunked_attention/demo_returnn_config.py
With control flow v2:
2023-02-17 10:02:03.997491: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT32
}
}
}
is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
}
while inferring type of node 'output/rec/while/body/_38/output/rec/prev_target_embed_moved_input/cond/output/_1608'
2023-02-17 10:34:46.595736: W tensorflow/c/c_api.cc:291] Operation '{name:'global_step' id:1961 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-02-17 10:35:56.799620+0100 python3[5197:2744697] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
...
2023-02-17 10:36:01.801307+0100 python3[5197:2744697] Execution of the command buffer was aborted due to an error during execution. Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored)
...
(Related: https://github.com/tensorflow/tensorflow/issues/57052)
With control flow v1:
2023-02-17 10:10:01.733679: W tensorflow/c/c_api.cc:291] Operation '{name:'global_step' id:1528 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-02-17 10:10:14.257716+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.257754+0100 python3[3727:2732582] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258366+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258504+0100 python3[3727:2732582] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258541+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:14.258587+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:19.258726+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
2023-02-17 10:10:19.258784+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)