Small Language Models

When Large Language Models are just too large

4 min readJan 14, 2024

A stack of old books — Photo by Chris Lawton on Unsplash

While Big Tech companies battle it out building the largest language models, there’s a whole thread of work on small language models that’s getting very interesting.

The size of language models is usually given by the number of parameters. Google’s BERT is one of the first ‘large’ language models to have been released, back in 2018. BERT’s base model has 110 million parameters while its large model has 340 million. BERT Large takes up about 1.5GB of storage on disk — bigger than models that came before, but not too unwieldy that it was difficult to use.

Since then, models have got much larger. OpenAI’s GPT-3 has 175 Billion parameters, taking up 800GB on disk. Only a handful of organisations have the capacity to train models this large. Of course, GPT-3 is also closed source, so the only way to use and finetune it is via OpenAI’s web interface or API. You can’t download it a run your own experiments. That makes it difficult to use for research.

Another model to know about is Meta’s Llama-2. Importantly, this is an open source model that’s freely available to build on. It has versions ranging from 7B to 70B parameters — smaller than GPT-3 but quite a bit bigger than BERT.

Roughly, 10B parameters is considered the cutoff for a “small” language model, and Llama-2’s 7B model fits neatly in this category. While still not small enough to run on an average laptop, it’s far easier to work with than its larger cousins.

So, how to build these more compact models?

The first observation is that there’s a relationship between ideal training data size and number of model parameters. Some of the very largest models are trained on relatively small amounts of data for the number of parameters they have. DeepMind’s Chinchillamodel exploits this relationship. It’s a 70B model but outperforms larger models that were trained on less data. Llama-2 builds on this idea of more training data and builds models with even fewer parameters, ranging from 7–70B. Training smaller models on larger amounts of data is one effective way to reduce their size but keep good performance.

A second idea is to use adaptation techniques to fine tune a small LM in innovative ways. Stanford’s Alpaca and Microsoft’s Orca-2 are both finetuned versions of Llama-2 7B. Alpaca used instruction tuning with a set of 52k instructions that Stanford subsequently open-sourced. Orca-2 used explanation tuning — an approach similar to instruction tuning but includes detailed reasoning in the prompts and answers to force the model to be able to reason effectively. In their experiments, Alpaca performed favourably compared to GPT-3 while Orca-2 performed similar to or better than models 5–10 times larger. We’re likely to see much more research about how to build datasets and finetune small models.

A third component is to use LLMs to generate training data, and use that data to train the smaller models. Alpaca and Orca-2 both use LLMs to generate the finetuning data they use. TinyStories (< 35M parameters) and Microsoft’s phi-1.5 (1.3B parameters) are two more small models which are trained from scratch using synthetic data. They use LLMs to generate stories and “textbook-like data” respectively. Models trained on TinyStories can generate fluent and grammatically correct English text, despite being orders of magnitude smaller than other models. Phi-1.5 performs similarly to the larger Llama-2 7B model on common sense reasoning tasks, perhaps as a result of the textbook-like training data which contains exercises and other examples of reasoning. Building small models for specific tasks like these makes it much easier to examine the way these models work and observe how their capability develops through the training.

If capable small LMs can be built, then they’re going to have a big impact.

Practically, small models are far easier to work with. They don’t need a hefty computation setup and it’s much easier for researchers to iterate quickly on ideas. It’s far easier for smaller organisations to build their own small models than to work with large models. Being easier to work with, it’s also feasible to spend effort in probing small LMs to better understand what’s going on inside the model.

Small models also have the potential to be trained for good performance on very specific tasks. Plus they can more easily be built and finetuned in privacy-sensitive domains like healthcare.

Aside from the large computational cost of training large models, they also have an environmental cost — both in the training and the use of large models. Smaller models can cut that cost — perhaps not in training when small models are simply trained on more data, but certainly at inference time when each call to a smaller model requires less computation.

Finally, the impact on regulation and governance will need to be seen. Governments are still figuring out how to regulate general models like LLMs, and size is a factor in their thinking. The US for example recently released an executive order suggesting using the computational power needed as basis for deciding if a model has potential malicious capabilities. In a world where smaller models thrive, that may need to be rethought.

Whatever happens with LLMs, some of today’s most interesting research is being done in the context of small language models. It’s worth paying attention!

Small Language Models

When Large Language Models are just too large

Further Reading

Written by Catherine Breslin

No responses yet