This seems odd to me. I have never seen obfuscation techniques in first party Apple software - certainly not in Espresso or ANECompiler and overall nowhere at all except in media DRM components (FairPlay).
Apple are really the major OS company _without_ widespread use of a first party obfuscator; Microsoft have WarBird and Google have PairIP.
I went digging down the rabbit hole over the last 6 hours on what compute around training can be extracted from M4/M5 Neural Engine chips:
- was able to offload
@karpathy's NanoGpt training run(partially) on Apple Neural Engine.
- moved the Classifier & Softmax layers directly onto the ANE - Classifier is 10x faster, and Softmax is 34x faster
- fixed memory exhaustion: original repo had an ARC memory leak that capped training at ~119 compile loads per process.
- patched the C-bridge, allowing continuous, stable training
Why does apple want to make this hardware hard to access?
What actual benefits do they get?
I guess they can have their own models run faster than the competition on their hardware? But they don't even really have anything that consumers use on the ANE as far as I can tell and local LLMs are taking off on macs and could really benefit from this
I suspect main benefits are they have no need to maintain the hardware or software for any longer than it makes sense for their own needs, and don't have to handhold users through a constantly evolving minefield of performance and technical capabilities.
> Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively
Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.
Humans also write endless amounts of convincing bullshit, and have done since time immemorial. False papers and faked results have been a growing scourge in academia before LLMs were a thing, and that's just counting the intentional fraud - the reproducibility crisis in science, especially medical and psychological science, affects even the best designed and well intentioned of studies.
Humans also make mistakes and assumptions while reverse engineering, so it will always need more engineers to go through the results, test things
But not 38 TOPS that Apple claims, with the weak explanation of
> Apple’s “38 TOPS INT8” is computed as 19 TFLOPS FP16 × 2, following the industry convention of counting INT8 operations as 2× the FP16 rate. But the hardware doesn’t actually execute INT8 operations twice as fast.
Why would Apple follow that convention when the hardware explicitly doesn't seems like a more straight-faced lie that I expect from Apple
Can someone help me understand when these neural engines kick in in open source software?
I typically use python ML libraries like lightgbm, sklearn, xgboost etc.
I also use numpy for large correlation matrices, covariance etc.
Are these operations accelerated? Is there a simple way to benchmark?
I see a lot of benchmarks on what look like C functions, but today in my jobs I rely on higher level libraries. I don't know if they perform any better on apple HW, and unless they have a flag like use_ane I'm inclined to think they do better.
Of course chatgpt suggested I benchmark an Intel Mac vs. newer apple silicon. Thanks chatgpt, there's a reason people still hate AI.
> when these neural engines kick in in open source software?
It mostly doesn't because NPUs are bespoke and vendor-specific (which incents neglect by software devs working on open source numerics and ML/AI infrastructure), and the Apple ANE is no exception. Part of this effort is most likely about fixing that for the specific case of the Apple ANE.
Part of which effort? The Reverse engineering is so it can be used blog article?
I just think: great it seems like I'm paying for a hardware accelerator that makes Siri go faster. And I use siri on my laptop exactly 0 times in the last infinite years.
It also makes a lot of really useful features like on device OCR, captions, voice isolation, temporal antialiasing in metalfx, an enormous host of things in the apple pro apps, etc. work
Much of this information we already knew the very basics of from documentation of the M1/M2 ANE as accessed via bare-metal from Asahi Linux, but it's nice to see confirmation and it being explored in further depth. Note that according to OP Parts 1/2 for very large matmuls CoreML adds little to no overhead compared to the lower-level interface, so there seems to be plenty of scope for supporting ANE for prefill in local AI frameworks. Decode is generally memory-bandwidth limited unless context is very large, and the ANE requires special handling (converting from matmul to 1x1 convolution as described here is wasteful of memory bandwidth, as is potentially dequantizing to INT8/FP16 in memory) so it's less of a clear win.
* They haven’t said the source isn’t available to them, just that the closed nature of the ANE means they can’t use it in OSS.
* They’ve repeated constantly that it can’t do backprop and isn’t useful for most MLX use cases.
And really, ANE isn’t even that interesting for MLX really; it’s a limited resource power efficient inference engine for smallish edge models. If you want to use it you can use the Apple APIs, which while limited are generally “shaped” like what you’d want to do anyway. Almost every “biggish” CPU has one of these now and Apple don’t want to give away the specifics of theirs (even though it’s been pretty thoroughly RE’d by real REs and re-summarized by Claude, like this article).
actually, it really is not neccesarily a 'hardware company' thing. ive been in 'hardware companies' where the rtl was just as available for viewing as the rest of the firmware/software.
in big hardware companies, things start getting siloed, but that probably has more to do with big companies (seemingly invariably) operating as a union of fiefdoms (dunbar-number-ification?)
I'm not op but I don't think op meant to shame, I understand the construction "tell me you're... without telling me" as a way to highlight that something is unexpected to people who haven't done something, that is that something is particularly unintuitive without some special experience.
> It's insane that the source code of ANE is not available even to the MLX team
no it's not insane - it's completely mundane policy. that's my point - that you're calling something out as insane with exactly zero experience (which is the actually insane thing...).
on that line of argument, nobody would have ever called out the emperor for not wearing any clothes, civilians would not go to peace protests, and nobody would ever improve things by looking at something from another angle.
This is a completely asinine take - you're not observing the emperor with no clothes here - you're completely outside the kingdom hypothesizing that the emperor has no clothes. To wit: you don't actually know the the ANE "source" isn't available to MLX. Hint: it probably is but there's just red tape involved.
The recent news is that Apple is supposedly replacing the Core ML framework with an updated version that will make it easier to integrate third party LLMs into your apps.
> the company is also planning a few other software-based AI upgrades, including a new framework called Core AI. The idea is to replace the long-existing Core ML with something a bit more modern.
Is it really worth having separate GPU and NE? Seems redundant and weird compared to what Nvidia is doing, i.e. "GPUs are good NEs", or is that not really true?
No, GPUs are not what you'd design for neural networks from first principles. They were adopted for that because they offered far more parallelism than general purpose cpus, not because they're ideal. That's why Google et all designed TPUs that have a very different internal structure.
Most TPU designs have been based around systolic arrays, which for matrix ops have a quadratic speedup. A typical design is a 128x128 array of MAC units. You shift weights along one dimension, parameters along the other. It takes 128 cycles to shift a full matrix input in, then 128 cycles to shift the answer back out, but during those 256 cycles you got 16,384 MAC operations done, for a factor of 64 speedup.
The other big appeal of this design is it's way simpler than GPUs. The memory access patterns are predictable, there's no threads or thread divergence, etc. So it can be way more efficient in silicon, not just in area but especially in power efficiency.
There's other ideas for architectures besides this basic systolic array idea. If you want to learn about them, a good place would be the HotChips presentations of the last few years: https://hc2025.hotchips.org and similar domain names for prior years.
What’s the intent of pointing out the presumed provenance in writing, now that LLMs are ubiquitous?
Is it like one of those “Morning” nods, where two people cross paths and acknowledge that it is in fact morning? Or is there an unstated preference being communicated?
Is there any real concern behind LLMs writing a piece, or is the concern that the human didn’t actually guide it? In other words, is the spirit of such comments really about LLM writing, or is it about human diligence?
That begs another question: does LLM writing expose anything about the diligence of the human, outside of when it’s plainly incorrect? If an LLM generates a boringly correct report - what does that tell us about the human behind that LLM?
We've got about a year before so many people are interacting with LLMs on a daily basis that its style starts to reverse infect human speech and writing
Great insight – Would you like to try and identify some specific "AI-isms" that you've noticed creeping into your own writing or your colleagues' emails lately?
Gawd Damn LISTICLES!!!! And all of those articles that list in bullet points at the top of the article the summary of the article. And all of those people saying they don't want to read exposition, just give me the bullet points.
It's already happened to me. I've started to have dreams where instead of some sort of interpersonal struggle the entire dream is just a chatbot UI viewport and I'm arguing with an LLM streaming the responses in. Which is super trippy when I become aware its a dream. In the old days I'd dream about playing chess against myself and lose which was quite bizzare feeling because my brain was running both players. But thats totally normal compared to having my brain pretend to be an LLM inside a dream.
Also the Prior Art section, which has telltale repetition of useless verbs like "documenting," "providing insight into," and "confirming" on each line. This was definitely AI-written, at least in part.
Below are the items from that section. How should they be written to not look like an AI?
> hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.
> mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.
> eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.
> apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.
Reverse Engineering with AI is only going to get better. I have seen some crazy things friends of mine have done with Claude alone. Let's just says SaaS isn't the only industry that could one day suffer.
I remember the good old days when Apple was desperate for developers and produced great documentation and there were a lot of great 3rd-party books too. You can't just give out awards in hopes that someone will make that great app.
I have always wondered if the neural engine could be used for training - pretty excited for part 3 of this to see if the juice is actually worth the squeeze
For me, what AI brings is augmented humans. Just as we don't calculate on paper anymore, what is the reason of doing things by hand when a machine in X times better.
Want to code by hand, as artisans of old? Suit yourself.
If you strip away the branding, Apple has and continues to ship a ton of algorithms that likely use the ANE and end users can use CoreML to do the same.
Just some things that people will likely take for granted that IIRC Apple have said use the ANE or at least would likely benefit from it: object recognition, subject extraction from images and video, content analysis, ARKit, spam detection, audio transcription.
Don’t forget FaceID and many of the image manipulation.
And while everyone else went to more powerful giant LLMs, Apple moved most of Siri from the cloud to your device. Though they do use both (which you can see when Siri corrects itself during transcription—you get the local Siri version corrected later by the cloud version).
I just wanted to say that you’ve done an excellent job and am looking forward to the 3rd installment.
reply