Think about whenever you use a large language model (LLM), be it GPT-4, Claude, LlaMa, or [your preferred model]. You have a question or request, submit it to the LLM of choice and the model responds. When a query is sent directly to a foundational model - the LLM of choice - without any intermediate step or engineering, this process is called zero-shot prompting. The model receives a query from the user, processes the query against its internal source of information, and returns a response.
Recently reported results such as GPT-4 Passing the Bar Exam as well as strong scores on the LSAT, GRE and SAT, are all examples of the zero-shot or lower bound capabilities of such models. In other words, much better results can actually be achieved if we layer clever engineering techniques on top of these base capabilities. We have previously highlighted what is arguably the simplest of such techniques - prompt engineering (what we called half-shot prompting). However, more sophisticated approaches are in fact possible.
An Overview of Retrieval-Augmented Generation (RAG)
RAG or Retrieval-augmented generation adds a layer to the process. In the case of RAG, a query does not go directly to an LLM; rather, queries are first gathered and used to search through a large corpus of text that is external to the LLM and retrieve all relevant information. When an additional step is added to the process of querying an LLM, such as in RAG, it is called single-shot prompting. The relevant information is coupled with the original query and then fed to the LLM.
Importantly, the information retrieved is derived from a source external to the foundational model, allowing RAG to be implemented on any LLM without having to retrain the model. This is especially important given that LLMs are often trained on a variety of data, not all of which is high quality. (Most LLMs are trained on the internet…including sites such as Reddit.)
To understand the power of RAG and compare it to prompt engineering, let us consider the following analogy. Imagine you are a chef that has been given the task of making a special dish upon a customer’s request. Prompt engineering is like the customer carefully making her request to the chef. Perhaps she asks the chef to make a vegetarian pasta or even goes so far as to tell the chef which specific ingredients she wants included in her meal - tomatoes, basil, extra virgin olive oil. The clearer and more precise the request, the more likely the chef understands what the customer wants and makes something that fulfills the customer’s desires.
In contrast, RAG is like having a waiter which takes the customer’s order and upon receiving it, goes to the pantry, searches through the ingredients and presents the chef with all of the items that are relevant to creating a meal that fits the request of the customer. If the customer asks for a vegetarian pasta dish, the waiter would collect and present pastas, vegetables, cheeses, olive oils, and so on.
Both prompt engineering and RAG are methods for increasing the likelihood that the LLM will provide the best - most accurate, thorough and helpful - response possible. But, while manual prompt engineering relies on the user’s ability to give the correct input, RAG goes even one step further. It acts as a systematic approach to prompt engineering, algorithmically enhancing the user’s input to ensure a more precise and relevant response.
To Riches, Beyond RAG
While RAG enhances the adaptability and efficiency of LLMs, it cannot account for any core limitations stemming from the foundational models. And let’s be clear – today’s foundational models (while amazing in the broad capabilities) are still limited. To maximize the capacity of generative AI requires not only guiding LLMs with prompt engineering or RAG, but actually altering the large language models at their core. As we discussed previously, methods such as legal embeddings are important to support tasks such as retrieval augmentation, but we can and should go beyond RAG to achieve better results. A large corpus of legal information, such as our Kelvin Legal DataPack, can be used to fine tune an existing foundational model for a specific use case. Ultimately, the Kelvin Legal DataPack can be used to support the building of a legal specific foundational model, an actual ‘LegalGPT’.