· 6 min read
All Models Great and Small
On why larger models are not always better
Large language models are large - composed of billions or trillions of parameters and trained on billions or trillions of tokens. Yet even among LLMs there is a continuum of scales and underlying approaches to model construction.
A team charged with building a model is faced with two big questions. First, how much input data does it need to learn from? And second, how large of a model should one render for the task at hand. It’s like figuring out the right amount of books to read and the level of difficulty those books should be to become really good at understanding and using language.
Turns out, you might not need an enormous model if you can give it a large amount of good data and train it on said data more than once. This idea is flipping the script on what we thought worked best and is showing us a whole new way to make these models both smart and efficient.
Thesis One: Better Performance Requires Bigger Models (Scale Is All You Need)
Historically, large language models have been developed under the axiom that “better requires bigger” - to increase the performance of a model, you must increase the size of the model. This belief stemmed from the observation that larger models, that is, models with more parameters, consistently delivered improvements in performance, understanding, and generation capabilities across a variety of tasks. This conviction has its roots in the early days of natural language processing and has been reinforced over time as each successive leap in model size — from the millions to the billions of parameters — has brought about notable improvements in linguistic understanding, contextual nuance, and task generalization. The journey from the pioneering models like OpenAI’s GPT-1 to the behemoths of today, such as, GPT-4, Gemini or Llama, has seemed to confirm the notion that with greater scale comes enhanced capability.
Thesis Two: Optimal Ratio of Training Data to Model Size (Chinchilla Is All You Need)
More recently, in an effort to optimize the efficiency of LLMs, researchers studied the scaling laws of large language models. Scaling laws describe how the performance of generative AI models improves with increases in training data (tokens) and/or model size (parameters). Finding accurate scaling laws allows researchers to better predict how LLMs will behave as they grow or shrink and thus guide the design and optimization of them.
In an effort to find the optimal model size to data ratio (i.e., the parameters to tokens ratio) for a given compute budget, researchers [Hoffman, et al (2022)] trained over 400 language models of different complexity - ranging from a model size of 70 million to over 16 billion parameters and a training data size of 5 to 500 billion tokens - and concluded that for compute-optimal training, both the model size and the amount of training data should be scaled. More specifically, for every parameter in a model, there should be approximately 20 tokens in the training data. This means that to train a large language model with a billion parameters, a data set of around 20 billion tokens is required. This scaling law - named the Chinchilla Scaling Law - contrasts from previous scaling laws which emphasized increasing model size as the primary path to improving performance in language models.
Thesis Three: Smaller Models Can Outperform Larger Models (Distillation and Multiple Epochs Are What You Need)
Recent work has brought the Chinchilla Scaling law paradigm into question. Several recent projects have developed LLMs with massively disproportionate model sizes to tokens. For example, consider two different versions of LLaMA - a 13 billion parameter model (LLaMA-13B) and a 65 billion parameter model (LLaMA-65B). After being trained on trillions of tokens - more tokens than what is recommended by the Chinchilla Scaling law - LLaMA-13B outperformed GPT-3 (175B) on most benchmarks and LLaMA-65B was competitive with Chinchilla-70B and PaLM-540B.
Researchers have also found that smaller models improve their performance with multiple epochs, that is, with multiple passes over the training data. In their 2024 work, Zhang et al. introduce TinyLlama, a 1.1 billion parameter language model that has been pretrained on approximately 1 trillion tokens across roughly 3 epochs. Utilizing the architecture and tokenizer from Llama 2, TinyLlama not only achieves superior computational efficiency but also exhibits outstanding performance on a variety of downstream tasks, outclassing similar-sized open-source models. As Zhang notes, “Touvron et al. (2023) demonstrates that smaller models, when trained with more data, can match or even outperform their larger counterparts… [and] Thaddée (2023) suggest that existing scaling laws (Hoffmann et al., 2022) may not predict accurately in situations where smaller models are trained for longer periods.” (TinyLLama - An Open Source LLM) Tiny Llama highlights the potential for skewing the ratio of tokens to parameters while exploring the benefit of training over multiple epochs.
In a similar vein, a research team at Microsoft has developed a series of Small Language Models (SLMs) named “Phi,” which have shown excellent performance on various benchmarks [Gunasekar, et al 2023]. The initial model in this series, Phi-1, contains 1.3 billion parameters and set new standards in Python coding tasks, particularly on the HumanEval and MBPP benchmarks, making it the best among SLMs of similar size. Phi-2, on the other hand, is a more advanced model with 2.7 billion parameters, trained on 1.4 trillion tokens from a blend of synthetic and web datasets. Despite its compact size, Phi-2 showcases exceptional reasoning and language understanding capabilities, outperforming models up to 25 times its size in various benchmarks, making it a remarkable achievement in the efficiency and effectiveness of SLMs.
Kelvin Legal Large Language Model (KL3M) (Careful Rendering of High Quality Input Data is What You Need)
Our Kelvin Legal Large Language Model (KL3M) builds upon many of the ideas above and adds a variety of wrinkles to mix. We begin with the Kelvin Legal Data Pack - a proprietary dataset that now contains over 2 Trillion tokens of legal, financial, and general domain text. We render down this training data by building a custom pipeline for continuously scoring and filtering content. This results in 350B of high value input tokens which pipe into our 1.7B model (a ratio of tokens to parameters of over 200 to 1).
The initial results for KL3M are very promising in terms of both perplexity score and toxicity score. The combination of data quality, multiple epochs, our proprietary rendering process together with some other exciting items in the works are going to yield an even better generation of KL3M model. Namely, we are actively scaling KL3M to a 7B mixture of experts model (MoE) and we will continue to climb the scaling ladder in 2024. Stay tuned for more soon!
Jessica Mefford Katz, PhD
Jessica is a Co-Founding Partner and a Vice President at 273 Ventures.
Jessica holds a Ph.D. in Analytic Philosophy and applies the formal logic and rigorous frameworks of the field to technology and data science. She is passionate about assisting teams and individuals in leveraging data and technology to more accurately inform decision-making.
Would you like to learn more about the AI-enabled future of legal work? Send your questions to Jessica by email or LinkedIn.