Making LLMs Work
Prompt Engineering, Instruction Tuning, RLHF & other ways to make LLMs work
LLMs are everywhere, and organisations are busy figuring out how best to work with them. This post looks at some of the ways to adapt and use general purpose LLMs for your own tasks.
First, some useful background. The starting point for any LLM is a base model that predicts the next word, like a smart autocomplete. These models learn from text data, and today’s LLMs have learnt from huge swathes of the internet. One of the big advantages of an autocomplete model is that the only data needed to train them is ordinary unaugmented text data. This is known as unsupervised learning. Contrast to supervised learning, where text has to be annotated with labels — a much more expensive and time consuming job. Supervised learning datasets might include text that’s labelled for a specific task, like sentiment analysis, translation to a different language, or how similar two sentences are. Unsupervised datasets can be much larger than supervised sets, and this is why you hear about LLMs being trained on billions of words of text.
Now, back to using LLMs. If you’ve interacted with ChatGPT, you’ll have a sense that it’s doing more than just autocomplete. Companies building LLMs have usually done some more work to make their models generally useful. But, those publicly available LLMs still know nothing about your organisation and the work you do. Broadly, methods for tuning LLMs to your organisation split into two groups — prompting & finetuning.
Prompting is something of an art form, with much ink spilled about Prompt Engineering. There’s lots of advice out there, including prompt chaining, adding specific examples in the prompt, and breaking tasks down. It’s an iterative process to find good prompts that work for your organisation. There’s even work that had LLMs themselves try prompt engineering! It turns out that prompting some LLMs to “take a deep breath” is an effective technique to improve their performance.
However, prompts in current LLMs have a length limitation. That’s a problem when you want to use LLMs to work with information that’s somewhere in your organisation’s vast store of documents. You can’t just copy and paste all your company documents into a single prompt. This is where Retrieval Augmented Generation(RAG) comes in. RAG consists of a first step (“retrieval”) to find a relevant document from your database, and then adding that into the LLM prompt (“augmentation”). This combination of an external LLM with your internal knowledge store can be a powerful one.
Prompt Engineering is the easiest way to get started using LLMs in your organisation as there’s no need to do anything to the model. RAG requires more Software Engineering work to get the database of your documents up and running, but there are tools out there to make this part easier.
Prompting can only get you so far though. Fine-tuning methods go further by updating the LLMs themselves to be more relevant for your organisation
Finetuning typically means to continue training a general model using a domain-specific dataset so that it works better in that domain. For example, a general LLM may be fine-tuned on transcripts of financial earnings calls, to make a model that’s more useful for financial applications. Or it might be fine-tuned on scientific research papers, to create a model that can assist in scientific research. In the context of LLMs this is still an unsupervised way of training as models are trained using plain text data to do next-word prediction. The result though is a model with much more knowledge about the particular domain you’re interested in.
Transfer learning is closely related to fine-tuning but usually involves taking a model that has been trained for one purpose, and reusing it for another (i.e. transferring the learning between the two tasks). A model that’s been trained for predicting the next word (an LLM) might easily be repurposed to do a specific language task like sentiment analysis or sentence similarity. Transfer learning takes advantage of the patterns that the base model has learnt about (in this case) language. Practically, the last layer of the underlying neural network model is often replaced by a new layer that’s designed for the second task, and the model is fine-tuned using a supervised dataset that’s been collected and labelled for the second task. This usually gives better results than just training a model for the second task from scratch.
Instruction tuning creates a set of prompts and good answers for those prompts, and fine-tunes the LLM on these. This is a supervised task, because the set of prompts and answers have to be curated. Thus the data is more expensive to obtain, but the end result will be a model that has learnt to follow instructions in a conversational style. Instruction tuning is one difference between a model that can follow instructions like ChatGPT, and one that just autocompletes text. Stanford’s Alpaca dataset is an example of an instruction tuning dataset that people have used with LLMs.
Fine-tuning, transfer learning & instruction tuning have similarities. When the base model is large, and your fine-tuning datasets are small, updating all the parameters of the model usually leads to poor performance! There are some parameter efficient fine-tuning (PEFT) implementations to get around this. To successfully do fine-tuning, transfer learning or instruction tuning you also need to be collecting the right data, have someone in your organisation with the right software and ML skills, and ideally access to enough computing power for training.
Reinforcement learning from human feedback (RLHF) is the last method in this post. We would like to include feedback from people interacting with an LLM about how good its answers are — that intuitively seems like a good signal that can be used to improve an LLM. But, people find it very difficult to score answers from an LLM on a scale of 1–5. There’s a lot of subjectivity and context that comes into deciding whether an answer is appropriate or not. In general, people can rank answers more easily, and say whether one is better than another. RLHF has people rank different answers from models, and uses that as the feedback for updating the base model. RLHF is very powerful, but is used less often as it can be tricky to get right.
As with everything in the AI world, this is a fast evolving topic. Researchers are actively finding new ways to make LLMs work across a vast range of tasks. Whether you use prompt engineering alone, dive into fine-tuning LLMs, or use a combination of approaches, this post gives you a survey of the current methods and some pointers to get started.
I work with companies building AI technology. Get in touch to explore how we could work together.