When an AI Model Solves College-Level Math and Physics — On a Phone
This morning I came across a model called Nanbeige4.1-3B, and what began as simple curiosity quickly became something more significant.
I loaded an already 4-bit quantized version and ran it locally on a phone. No GPU, no cloud support, no hidden infrastructure — just a compact reasoning model operating entirely at the edge.
So I decided to push it.
I started with classical mechanics: acceleration, force, friction on an incline. The model worked through them cleanly and correctly. Then I stepped into calculus and gave it a differential equation. It immediately recognized the structure, chose the proper method, carried the mathematics through without confusion, and verified the result.
It did not behave like a model trying to sound intelligent. It behaved like a system trained to solve problems.
And it was doing this on a phone.
For a long time, we have associated serious reasoning in AI with massive models and enormous compute. Capability was supposed to live inside data centers. Bigger models were expected to mean smarter systems.
But watching Nanbeige4.1-3B handle college-level math and physics forces a rethink of that assumption. Intelligence is not only expanding — it is compressing. Better training and sharper reasoning alignment are allowing smaller models to operate far beyond what their size once suggested.
When structured problem-solving runs locally on pocket hardware, the implications are larger than they first appear. Experimentation becomes personal. Engineers can explore ideas without waiting on infrastructure. Students can access serious analytical capability from a device they already carry. Builders are no longer required to send every complex task into the cloud.
The center of gravity begins to shift — away from centralized compute and toward the individual.
What makes moments like this easy to miss is that they rarely arrive with fanfare. There is no dramatic announcement when efficiency crosses a threshold. One day you simply notice that a small model is solving problems you would comfortably place in an early college classroom, and the old belief that intelligence must be enormous starts to feel outdated.
Teams behind models like Nanbeige4.1-3B may not yet have widespread recognition, but progress often enters quietly before its consequences become obvious.
We often imagine the future of AI belonging to giant systems. Yet models like this hint at another direction — one where powerful intelligence is not something distant in a data center, but something that runs beside you.
Sometimes technological change is loud.
And sometimes you just realize that an AI model solving college-level math and physics is sitting comfortably in your pocket.
The model responses here. Sorry for unformatted responses. The model produces formatted responses including rendered math equations
https://fate-stingray-0b3.notion.site/AI-model-Nanbeige4-1-3B-3043b975deec80118f3cc323ada9c1ff
The model used https://huggingface.co/Edge-Quant/Nanbeige4.1-3B-Q4_K_M-GGUF
The previous version 4 was so good, that it stayed on my computer until today. Usually i just test small models out of curiosity and then delete.
How do you run llamacpp on phone, what phone you've? And what are specs.
If you have a powerful Android phone, start by installing Termux. It is a terminal emulator.
Next, use llama.cpp to run the model locally on your phone.
You will need the model in gguf format. After that, run the server to chat with the model using the web interface.
Let me know if you need more details.
Thank you. Trying falcon tiny 90m instruct 4k quant right now.
Testing on old phone.
Thank you. Trying falcon tiny 90m instruct 4k quant right now.
Testing on old phone.
Let us know how it goes.
Is it possible that --top_k should be at 0, because when i tried 40 or 20, it was thinking a lot just to answer to the "Hello!". Once i tried --top_k 0, thinking was much less.
Thank you. Trying falcon tiny 90m instruct 4k quant right now.
Testing on old phone.
Let us know how it goes.
I got old phone, did not work (bus error). I've compiled llama.cpp with right flags but still I get this error, though I have it compiled correctly and I can even get applications to run (llama-cli, llama-server etc with -h for help flag but as soon as I provide it model I get this error). There is not much information on the error so I can't figure out what is causing it (there is sufficient RAM available, over 1GB free, 3GB in cached and only 3 GB is under use out of 6GB total, so I don't know what is causing it).
I got Infinix Zero X (NEO).
Model on computer does not consume much RAM (negligible so I can't even see how much is it exactly taking, I am talking about Falcon H1 Tiny (90M) with Q 4.) However I've seen a pattern in H1 models, they tend to consume a large RAM due to SSM models hybrid, so perhaps someone should try with llama (pure transformer's arch NOT A HYBRID) with llama 3.2 1B might be a good candidate for it?
@LLMToaster I don't think you are supposed to compile llama.cpp on phone. Just run "pkg install llama-cpp" inside termux, and then it should work.
The model is great, and tool-calling capabilities are also good considering its size, but it seems like it thinks for too long and generates too many thinking traces before answering the question.
Is it possible that --top_k should be at 0, because when i tried 40 or 20, it was thinking a lot just to answer to the "Hello!". Once i tried --top_k 0, thinking was much less.
@urtuuuu
I guess yes you should set it to 0, as transformers, and many other engines actually don't use top_k at all so it's equivalent to 0 in llama-cpp (default is 40 IIRC). If unspecified by the authors, I always force it to 0.
And same for min-p actually, llama-cpp default is 0.05 but in other engines it's not used at all, so I also set it to 0 unless specified.
Forget what I said, I just saw transformers set it to 50! But it's an exception, really other engines set 0 or just don't use it at all.
And as their example was with transformers directly, I guess 50 is actually what to set!
Transformers: https://deepwiki.com/search/what-are-the-default-values-fo_3f851ce7-de7e-49bf-b119-43fefd2db2e1?mode=fast
Sorry for answering too quickly
--top_k 40 seems to work fine
@LLMToaster I don't think you are supposed to compile llama.cpp on phone. Just run "pkg install llama-cpp" inside termux, and then it should work.
It worked thanks, however, why does not the build from source work? BTW, speed (token/s) is too low for a tiny model (90m params with Q4 quants) only 10tokens/s on llama-server and even 30tokens/s through llama-cli (still slower than expected).
Is not there a way to utilize GPU on mobile chip? (but then again, need to build from source, right?) @urtuuuu @ImadSaddik
@LLMToaster I'm pretty sure you didn't build it for arm64. If you want to build llama.cpp with Vulkan support inside Termux, you need to have glslc compiled for arm64 first before building llama.cpp. That said, it's definitely more pain than it's worth. And in most cases GPU (on android) is slower than just pure CPU inference, see https://github.com/ggml-org/llama.cpp/discussions/9464 or https://github.com/ggml-org/llama.cpp/issues/6337. The reason you are only getting 30 tokens/s with that model is that you are memory bandwidth bound, not compute bound. For meaningful inference on Android, you'd need a high-end phone with a strong NPU.



