AI Land moves at lightning speed. On Friday, a software developer named Georgi Gerganov created a tool called “llama.cpp” that can run Meta’s new large GPT-3 class AI language model, LLaMA, locally on a Mac laptop. Soon after, people figured out how to make LLaMA run on Windows as well. Then someone shown in the run on a Pixel 6 phone. Next came a Raspberry Pi (albeit very slowly).
If this continues, we may be looking at a pocket-sized ChatGPT competitor before we know it.
But let’s go back a minute because we’re not quite there yet. (At least not today — like literally today, March 13, 2023.) But what will arrive next week, no one knows.
Since ChatGPT’s launch, some people have been frustrated by the AI model’s built-in limitations, which prevent it from discussing topics OpenAI deems sensitive. Thus began – in some circles – the dream of an open source Large Language Model (LLM) that anyone could run locally without censorship and without paying API fees to OpenAI.
There are open source solutions (like GPT-J) but they require a lot of GPU RAM and disk space. Other open-source alternatives have failed to offer GPT-3-level performance on readily available consumer-level hardware.
Enter LLaMA, an LLM available in parameter sizes from 7B to 65B (that’s “B” as in “billion parameters,” which are floating-point numbers stored in matrices that represent what the model “knows”). LLaMA made a heady claim: that its smaller models could match OpenAI’s GPT-3, the fundamental model that powers ChatGPT, in the quality and speed of its output. There was just one problem – Meta released the LLaMA code as open source, but it withheld the “weights” (the trained “knowledge” stored in a neural network) only for qualified researchers.
Fly at the speed of LLaMA
Meta’s restrictions on LLaMA didn’t last long as on March 2nd someone leaked the LLaMA weights on BitTorrent. Since then, development around LLaMA has exploded. Independent AI researcher Simon Willison has compared this situation to the release of Stable Diffusion, an open-source image synthesis model launched last August. Here’s what he wrote in a post on his blog:
It feels to me like that stable diffusion moment in August started the whole new wave of interest in Generative AI – which was then kicked into high gear with the release of ChatGPT in late November.
That stable diffusion moment is happening again right now, for large language models – the technology behind ChatGPT itself. This morning I ran a GPT-3 class language model on my own laptop for the first time!
AI stuff was weird. It gets even weirder.
Typically, running GPT-3 requires multiple data center-class A100 GPUs (even the weights for GPT-3 aren’t public), but LLaMA made waves for being able to run on a single beefy consumer GPU. And now, with optimizations that reduce model size using a technique called quantization, LLaMA can run on an M1 Mac or smaller Nvidia consumer GPU.
Things move so fast that it is sometimes difficult to keep up with the latest developments. (Regarding the rate of progression of AI, another AI reporter told Ars, “It’s like those videos of dogs where you dump a box of tennis balls on them. [They] don’t know where to hunt first and get lost in the confusion.”)
For example, here is a list of notable LLaMA-related events, based on a timeline that Willison laid out in a Hacker News comment:
- February 24, 2023: Meta AI announces LLaMA.
- March 2, 2023: Someone publishes the LLaMA models via BitTorrent.
- March 10, 2023: Georgi Gerganov creates llama.cpp that runs on an M1 Mac.
- March 11, 2023: Artem Andreenko runs LLaMA 7B (slow) on a Raspberry Pi 44 GB RAM, 10 sec/token.
- March 12, 2023: LLaMA 7B runs on NPX, a node.js execution tool.
- March 13, 2023: Someone gets llama.cpp to work on a Pixel 6 phonealso very slow.
- March 13, 2023: Standord releases Alpaca 7B, an instruction-tuned version of LLaMA 7B that “behaves similarly to OpenAI’s “text-davinci-003” but runs on much less powerful hardware.
After obtaining the LLaMA weights ourselves, we followed Willison’s instructions and got the 7B parametric version working on an M1 Macbook Air and it runs at a reasonable speed. You invoke it as a script on the command line with a command prompt, and LLaMA will do its best to complete it in a sane way.
The question that remains is how much does quantization affect the quality of the output. In our tests, LLaMA 7B, reduced to 4-bit quantization, was very impressive for running on a MacBook Air – but still not at the level you might expect from ChatGPT. It’s entirely possible that better prompting techniques lead to better results.
Optimizations and fine-tuning also go quickly when everyone has the code and weights in hand—although LLaMA is still saddled with some fairly restrictive terms of service. Today’s release of Alpaca by Stanford proves that fine-tuning (additional training with a specific goal in mind) can improve performance, and it’s still early days after the release of LLaMA.
As of this writing, running LLaMA on a Mac remains a fairly technical exercise. You must have Python and Xcode installed and be familiar with working on the command line. Willison has good step-by-step instructions for anyone who wants to try. But that could soon change as the developers continue programming.
As for the impact of this technology in the wild, no one knows yet. While some worry about the impact of AI as a tool for spamming and misinformation, Willison says, “It won’t go undetected, so I think our priority should be finding the most constructive ways in which it can be used.” “
For now, our only guarantee is that things will change quickly.