Div calculation issue in metal

Hi, all. I've been writing various computational functions using Metal. However, in the following operation functions, unlike + and *, there is an accuracy issue in the / operation.

This is a function that divides a matrix of shape [n, x, y] and a scalar [1]. When compared to numpy or torch, if I change the operator of the above function to * or + instead of /, I can get completely the same results, but in the case of /, there is a difference in the mean of more than 1e-5. (For reference, this was written with reference to the metal kernel code in llama.cpp)

kernel void kernel_div_single_f16(
        device const half * src0,
        device const half * src1,
        device       half * dst,
        constant  int64_t & ne00,
        constant  int64_t & ne01,
        constant  int64_t & ne02,
        constant  int64_t & ne03,
        uint3 tgpig[[threadgroup_position_in_grid]],
        uint3 tpitg[[thread_position_in_threadgroup]],
        uint3   ntg[[threads_per_threadgroup]]) {
    const int64_t i03 = tgpig.z;
    const int64_t i02 = tgpig.y;
    const int64_t i01 = tgpig.x;

    const uint offset = i03*ne02*ne01*ne00 + i02*ne01*ne00 + i01*ne00;

    for (int i0 = tpitg.x; i0 < ne00; i0 += ntg.x) {
        dst[offset + i0] = src0[offset+i0] / *src1;
    }
}

My mac book is, Macbork Pro(16, 2021) / macOS 12.5 / Apple M1 Pro.

Are there any issues related to Div? Thanks in advance for your reply.

Div calculation issue in metal
 
 
Q