This does not seem to be effecting the training, but it seems somewhat important (no clue on how to read it however):
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG13XFamilyCommandBuffer: 0x29b027b50>
label = <none>
device = <AGXG13XDevice: 0x12da25600>
name = Apple M1 Max
commandQueue = <AGXG13XFamilyCommandQueue: 0x106477000>
label = <none>
device = <AGXG13XDevice: 0x12da25600>
name = Apple M1 Max
retainedReferences = 1
This is happening during a "heavy" model training on "heavy" dataset, so maybe is related to some memory issue, but I have no clue how to confront it
Hi, I have this same error on M2 MAX with tensorflow in LSTM The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x5cbea68f0> label = <none> device = <AGXG14CDevice: 0x13385a200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x1422ab400> label = <none> device = <AGXG14CDevice: 0x13385a200> name = Apple M2 Max retainedReferences = 1