GPT under $100,000?

For the last several years, we’ve been hearing about how much it costs to build ever larger language models.  Today, a state-of-the-art language model requires approaching a million-trillion-trillion (1024) arithmetic operations involving hundreds of billions of parameters.  Doing the math, assuming a decent, if older GPU, such as an A100, you come up with how many years this computation will take.  Then you figure out how many GPUs you need given how many days you have to complete the computation.  For example, Meta recently published that training a 65 billion parameter version of the LLaMA model using over a trillion tokens of text took approximately 21 days on roughly 2,000 such GPUs.  That’s almost exactly 1 million hours of GPU time, which can be had for less than $1,000,000.

So, for $1 million, given a few decent machine learning folks, you could replicate a state-of-the-art language model or build your own tweaked to perform better in your domain, such as Bloomberg has done given its financial market.  Expect to see much more of this from many corners, especially in healthcare, various areas of the life sciences, and others.

I would like to save the $1 million and start with the 30 or 65 billion parameter LLaMA model rather than train it from scratch.  Unfortunately, Meta is not forthcoming with model weights for LLaMA beyond 13 billion parameters.  The 13 billion parameter model is impressive enough.  The 7B model is not capable enough for me. The 65 billion parameter model would be better, but not twice as good.  The 30B parameter model is in the sweet spot.

Note that if you’re fine with a smaller pre-trained language mode, you could try Stability AI’s LM. These are the folks who brought you Stable Diffusion. They promise to eventually release, for any use including commercial, language models up to 65B. When available, the 15B model may be a good option. For now, I’d like to stick with LLaMA because of some of its significant algorithmic improvements.

Although available, even the 13 billion parameter model is not openly licensed.  As is frustratingly common, the model weights are licensed only for research, not for commercial purposes.  Meta invites commercial inquiry but, regrettably, based on my experience, is not eager to respond.  So, wanting to use LLaMA commercially, you may have to train your own model.  So, let’s talk budgets?

Training a 13 billion parameter model will cost about 20% of training a 65-billion parameter version.  That will cost less than $200,000.  You might be able to cut that cost by half, maybe more, however.  It’s a little dicey, but you can cut back on the training data.  Google’s excellent results from Chinchilla teach us to balance model size with training.  OK, but the truth is that you can get over 90% of a language model’s final performance with less than half a trillion tokens of training data. 

If you can afford it, you can avoid cutting back on the training data by taking your time.  Your language model will be pretty good in 30% of the time Meta took and you can just let it improve over time.  That is, start using the model and keep training it, replacing the one you’re using every once and a while.  This is viable even if you perform fine-tuning (and even reinforcement learning) because the relative costs for such tuning are quite small versus pre-training. 

The bottom line here is that you can build your own 13 billion parameter LLaMA for less than $100,000.  If you’re going to do millions of transactions, you might not be able to afford not to go in this direction!

LLaMA is essentially an improved version of Open AI’s GPT.  LLaMA benefits from various algorithmic improvements since GPT-3 was released a couple years ago.  Recently, Open AI introduce Instruct GPT, which follows instructions and Chat GPT, which holds conversations.  And GPT has advanced to version 4.

LLaMA is GPT without the instruction following or conversational abilities.  These are easy, and inexpensive, to add, however.  Consider instruction following, for example.  Researchers from Stanford generating tens of thousands of simple instructions and results using GPT-4 and trained the 7 billion parameter version of LLaMA with them.  The dataset is relatively simple, and I thought weak, but was remarkably effective.  I was quite surprised how well it follows instructions given only that simple, synthetic dataset.

On the other hand, it’s not all that surprising, given that we have seen much transfer of learned representations in vision.  The ease of improvement here is simply because any decent generative language model will quickly adapt to using its representation to new linguistic sequences, such as those involving instructions.  It doesn’t have to construct much new representation to do so.

The resulting language model is dubbed Alpaca; a cute play on words.  Well, now there is Vicuna!  Vicuna takes the 13 billion parameter LLaMA to approach ever more closely to Open AI’s state-of-the-art performance.  According to the researchers from CMU, Stanford, and the University of California at Berkley and San Diego.

Look them up.  It’s stunning how easily they compete with Google and approach Open AI.  And the training cost to improve LLaMA to “within 10%” of GPT was less than $1,000.  More and more is happening on this front.  For more, see Microsoft’s DeepSpeed-Chat (which may seem odd given their investments in OpenAI, the company.).

Will Oligarchs Own the Future of AI?

Here we go again.  We are set back a bit this morning (two weeks ago now) with a recent Tech Crunch article about Anthropic, perhaps the most inspiring company touting safe AI.  They seek to raise billions to compete with not so Open AI.  Before commenting on the article, how about a little context?

Open AI, the company, was founded in 2015 precisely to, as stated on Wikipedia today, “freely collaborate” […and make] its patents and research open to the public.  Things began to change in 2019 as Open AI transitioned to a for-profit corporation.  Ultimately, this year, Open AI has written that they will no longer “share [any] details for commercial and other reasons”.

Last year, in a podcast with Sam Altman and Reid Hoffman, the CEO of Open AI suggested that only a handful of companies could provide the foundational AI models on which everyone else will build “the middle layer”.  This suggests that foundational AI will simply become part of “Big Tech”, raising familiar questions of who owns the future.

Open AI has been stodgy with being open since it initially refused to share the GPT-2 model in 2019, eventually doing so, arguably due to pressure from the AI community, in particular.  Open AI has not shared any subsequent model. 

Open AI (little ‘o’; not the company) does not mean commercial AI becomes impractical.  The intent of open AI is to keep fundamentals of nature, like math, electricity, and fire (including nuclear power), from becoming private property at the expense of society.  Making a living harnessing them and applying them innovatively should remain fair game.

Our entire tech industry is built on shared intellectual property, most notably the open-source software movement.  Without open-source, much of modern life would be stuck decades in the past.  All the progress in machine learning over the last few decades would have been impractical without open-source operating systems and programming languages, such as Linux and Python, in particular.

AI models are a little different.  They have two critical parts.  One is the source code that implements them.  Typically, this is Python code which runs on tens to thousands of GPUs, which are massively parallel matrix manipulating machines.  Essentially, given data, the algorithms written in Python adjust the matrices until the error in predicting things about the training data is minimized (or nearly so).

This second part, the resulting contents of the matrices after training (a.k.a., the model weights), is where the controversy of open AI started with GPT-2.  Through GPT-3, Open AI was quite good about publishing details of the algorithms used in its models.  The AI community easily replicated such models, with various modifications.  Fair enough, that’s one degree of open AI.  Many believe it’s not enough.

Better is the general, open-source attitude among AI researchers, including many with commercial affiliation, and especially the Hugging Face community sponsored by Meta.  But having the source code of a model is not “democratic” enough.  Wherein democratic, means practically available to anyone and everyone.  Practically available to everyone requires both the source code of a model and the weights resulting from its training to be available.

Just a few details on the models and their weights.  The transformer architecture has been refined significantly but remains basically unchanged over the last 5 years.  The source code for producing the state of the art is widely available and gradually evolving as techniques improve.  In order to produce a state-of-the-art model, massive amounts of data are needed.  Whether training a language model from text or a multi-modal model with text, images, etc., we have enough readily available data to approach the state-of-the-art results democratically.  Where it becomes less democratic is the cost of computing the model weights given the training data.

The amount of computation required to train is model is (naively) proportional to the number of training iterations times the size of the model.  For the most part, the amount of computation is proportional to the number of parameters in the model.  The weights are simply the values of those parameters after optimizing the model by training it with the data.

Table stakes for a good language model, which generally require over 10 billion parameters, is 1023 floating point operations.  For example, Google’s Chinchilla proved a 70B parameter model superior in many regards to models many times its size.  Meta’s more recent LLaMA benefits from additional improvements.  A Chinchilla-scale model was trained on 2,048 A100 GPUs on over 1 trillion tokens of text in 21 days.  At arms-length, on-demand processing last year, this would cost roughly $1 million.

Commercial Open AI would have us believe that this is just the tip of an iceberg.  That $1 million today will be $1 billion tomorrow.  Open AI would have us believe that we can’t afford to keep up as they build models 10 to 100 times larger.  Well, the jury is out.  There have already been models with 3 to 10 times as many parameters as GPT-3 which have fizzled quickly.  But the intent is clear.

Unfortunately, according to the article, inspiring Anthropic now aspires to be one of the oligarchs of AI.  The article states that Anthropic’s investor pitch deck claims, “We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles.”  It goes on to assert that AI will automate large swaths of the economy in very few years.  Such hyperbole may further inflame unfortunate calls for a moratorium on AI.

We like Anthropic’s approach to Constitutional AI and are big fans of continuous, self-supervised learning, as well as reinforcement learning given human feedback.  These aspects have materially advanced the safety and the instruction-following and conversational abilities of language models recently.  But in doing so, they require an order of magnitude less compute than the brute force pre-training discussed above.

Anthropic thinks building a model with “tens of thousands” of GPUs will produce magic.  We’ll see.  There are a few stubborn facts in the way.  One problem is that increasing the size of a model and the amount of training data must be balanced.  Loosely speaking, one cannot just double the number of parameters without doubling the amount of training data.  One problem with doubling the amount of training data is that we are running out of data.  Another is that we are already approaching the asymptotes that we can get from numbers of parameters and amounts of training data. 

The basics of the learning curves for models of more than a few billion parameters is that the inflection point of diminishing returns is passed quickly, somewhere between 10 and 100 billion tokens of training data.  After “just” a few 100 billion training tokens, a model with 30 to 120 billion parameters begins to look asymptotically close to “fitting” the training data.  And the 30 billion parameter model fits the data over 95% as well as the model 4 times its size.

Whether or not size matters, other innovations are coming into focus now that we have sufficient scale.  Hopefully, we can avoid Big Tech, including Open AI and Anthropic, owning our future by more openly sharing models, including their weights.  If not, we can expect the innovations and advances in AI to slow as proprietary interests slow the exchange and experimentation that has produced staggering advances in the last decade.  Either way, if the limits of scale alone are indeed near, it’s not the end of the world.

Simon Says

Some folks use the term “automatic speech recognition”, ASR.  I don’t like the separation between recognition and understanding, but that’s where the technology stands.

The term ASR encourages thinking about spoken language at a technical level in which purely inductive techniques are used to generate text from an audio signal (which is hopefully some recorded speech!).

As you may know, I am very interested in what many in ASR consider “downstream” natural language tasks.  Nonetheless, I’ve been involved with speech since Carnegie Mellon in the eighties.  During Haley Systems, I hired one of the Sphinx fellows who integrated Microsoft and IBM speech products with our natural language understanding software.  Now I’m working on spoken-language understanding again…

Most common approaches to ASR these days involve deep learning, such as Baidu’s DeepSpeech.  If your notion of deep learning means lots of matrix algebra more than necessarily neural networks, then KALDI is also in the running, but it dates to 2011.  KALDI is an evolution from the hidden Markov model toolkit, HTK (once owned by Microsoft).  Hidden Markov models (HMM) were the basis of most speech recognition systems dating back to the eighties or so, including Sphinx.  All of these are open source and freely licensed.

As everyone knows, ASR performance has improved dramatically in the last 10 years. The primary metric for ASR performance is “word error rate” (WER).  Most folks think of WER as the percentage of words incorrectly recognized, although it’s not that simple.  WER can be more than 1 (e.g., if you come up with a sentence given only noise!).  Here is a comparison published in 2011.

Today, Google, Amazon, Microsoft and others have WER under 10% in many cases. To get there, it takes some talent and thousands of hours of training data.  Google is best, Alexa is close, and Microsoft lags a bit in 3rd place.  (Click the graphic for the article summarizing Vocalize.io results.)

Continue reading “Simon Says”