Div calculation issue in metal

Hi, all. I've been writing various computational functions using Metal. However, in the following operation functions, unlike + and *, there is an accuracy issue in the / operation.

This is a function that divides a matrix of shape [n, x, y] and a scalar [1]. When compared to numpy or torch, if I change the operator of the above function to * or + instead of /, I can get completely the same results, but in the case of /, there is a difference in the mean of more than 1e-5. (For reference, this was written with reference to the metal kernel code in llama.cpp)

kernel void kernel_div_single_f16(
        device const half * src0,
        device const half * src1,
        device       half * dst,
        constant  int64_t & ne00,
        constant  int64_t & ne01,
        constant  int64_t & ne02,
        constant  int64_t & ne03,
        uint3 tgpig[[threadgroup_position_in_grid]],
        uint3 tpitg[[thread_position_in_threadgroup]],
        uint3   ntg[[threads_per_threadgroup]]) {
    const int64_t i03 = tgpig.z;
    const int64_t i02 = tgpig.y;
    const int64_t i01 = tgpig.x;

    const uint offset = i03*ne02*ne01*ne00 + i02*ne01*ne00 + i01*ne00;

    for (int i0 = tpitg.x; i0 < ne00; i0 += ntg.x) {
        dst[offset + i0] = src0[offset+i0] / *src1;

My mac book is, Macbork Pro(16, 2021) / macOS 12.5 / Apple M1 Pro.

Are there any issues related to Div? Thanks in advance for your reply.

Hello @Chans0,

Thanks for reaching out about this!

I believe this may be related to fast math, which is generally enabled by default when you compile your kernel.

Try disabling fast math to see if you get the "correct" results.

To do that, set "Enable Fast Math" to "No" under the Metal Compiler - Build Options for your project.

Best regards,


Div calculation issue in metal