ML Systems Review

Apple M4 Max NPU: First Benchmarks on Transformer Inference

The M4 Max ships a 38 TOPS Neural Engine, CoreML 9, and quiet MLX 0.22 support for INT8 quantised paths. We ran ViT-L/16 forward passes against it and compared throughput per watt with the M3 Max and an RTX 4090.

Benchmarks
By Lukas Berg , MS Reviewed by Dr. Nadia Volkov , PhD
6 min read

Apple's M4 Max, released in the refreshed MacBook Pro line in March 2026, is the first Apple silicon part to clear the 38 TOPS mark on its Neural Engine, up from roughly 18 TOPS on the M3 Max generation. Apple's product marketing has been characteristically unspecific about what that number means under load. We ran a small but careful set of transformer-inference benchmarks against the new part, using CoreML 9 and the MLX 0.22 release, and compared the results with the M3 Max and with an Nvidia RTX 4090 driven by TensorRT 10.

The headline is that on ViT-L/16 at batch size 1, the M4 Max NPU reaches 11.4 ms per forward pass in FP16 and 7.2 ms in INT8, delivering roughly 2.7 tflops per watt on the sustained INT8 path. The RTX 4090 is faster in absolute terms — 2.9 ms per forward pass in INT8 — but draws about 340 W under sustained load, yielding around 0.51 tflops per watt on the same workload. The M4 Max's efficiency advantage is, as expected, substantial. The less obvious result is how close the M4 Max now sits to the 4090 on small-batch latency, which matters for interactive workloads.

Setup and methodology

The benchmark target was ViT-L/16 (307M parameters), an uncontroversial reference architecture for vision backbones. We used the official timm weights, exported to CoreML via coremltools 8.1 for the Apple path and to ONNX for TensorRT on the 4090. Images were 224×224 RGB, batch size 1 unless noted, with a warm cache; measurements are medians over 200 forward passes, with inter-quartile ranges reported where relevant.

The Apple path was measured in two configurations: CoreML with FP16 weights (the default ANE path) and CoreML with linearly quantised INT8 weights using Apple's palettization tooling. We also ran the same model through MLX 0.22, which now exposes the Neural Engine directly for supported operators, as a cross-check. MLX numbers were within 4% of CoreML on every configuration we tried, which is a noticeable improvement over MLX 0.18's NPU path.

Power measurement on the M4 Max used powermetrics sampling at 500 ms and was averaged over the benchmark run. The 4090 was measured with nvidia-smi power draw at equivalent cadence. Neither measurement is laboratory-grade, but the relative gap is large enough that minor error bars do not move the conclusion.

Results

Hardware Precision Latency (ms) Throughput (img/s) Avg power (W) Tflops / W
M3 Max NPUFP1618.9539.11.4
M3 Max NPUINT812.4818.42.2
M4 Max NPUFP1611.48810.22.0
M4 Max NPUINT87.213910.82.7
M4 Max (MLX 0.22)INT87.513310.92.6
RTX 4090FP164.12443320.47
RTX 4090INT82.93453400.51

ViT-L/16, batch size 1, 224×224 input. Median of 200 runs. Power is sustained average over the benchmark window.

INT8 quantisation on the M4 Max delivered a 1.6x latency improvement over FP16, a larger gain than the 1.5x we measured on the M3 Max. We attribute this partly to Apple's updated matmul units — CoreML 9's release notes reference redesigned INT8 matmul paths — and partly to better dispatch of mixed-precision operators in the 9.0 compiler. Accuracy degradation from the INT8 path was negligible on ImageNet-1k validation: top-1 dropped from 85.2% to 85.0% on the same checkpoint.

What the numbers mean

The M4 Max NPU is now a credible target for on-device ViT-class inference. 7.2 ms for a ViT-L/16 forward pass at under 11 W opens up a set of interactive use cases — real-time object search in photo libraries, live segmentation overlays, on-device retrieval-augmented generation over image corpora — that were previously on the edge of practicality for consumer hardware. The efficiency gap against the 4090 is roughly 5x in favour of the M4 Max on a sustained-tflops-per-watt basis. The 4090 remains the right choice when absolute latency matters more than power, but the set of workloads where that is true has narrowed.

Two caveats. First, these are batch-size-1 numbers; the 4090 scales to larger batches far better than the NPU, so any bulk-inference workload will look different. Second, the INT8 path on Apple silicon requires a symmetric-weight, channel-wise quantisation scheme that not every pretrained checkpoint tolerates cleanly. Teams accustomed to Nvidia's calibration tooling will find Apple's equivalent thinner, though MLX 0.22 adds a reasonable calibration helper that we expect to iterate quickly.

A fuller set of benchmarks — including CLIP ViT-H/14, Whisper large-v3, and a small language model — is on the way. For now, the headline stands: the M4 Max has closed a meaningful fraction of the gap to discrete GPUs on transformer inference latency, and has extended its lead on efficiency.

Further reading