Why Data Matters

Garbage in, garbage out. You are what you eat. You reap what you sow.

You might remember these phrases from your childhood (or even use them with your own children today). And while you probably don’t associate them with AI, these old adages have never been more true than when it comes to training and using large language models.

At each step in the process of training and tuning a model, data plays a critical role. If “good” training data is not used during the initial development of a foundation model, then models typically struggle to properly “understand” or generate language.

If “good” training data is not used during the subsequent retraining or fine-tuning of a model, then the results can be just as severe. Even well-trained foundation models can suffer from scary-sounding phenomena like catastrophic forgetting, model autophagy disorder (MAD), or other forms of “model collapse” - all of which essentially mean that bad data can ruin a good model.

Finally, if “good” data is not used during the development of retrieval-augmented generation (RAG) applications, then even the best models will produce confused, incomplete, or misleading outputs.

Only when good data is used throughout the training process can we expect any chance of producing complete, accurate outputs - i.e., reduced risks of omission or hallucination. What constitutes “good” data, however, varies based on the user case, preferences, and regulatory requirements of the beholder. For example, a consumer using a chat bot to come up with pet names will have a very different baseline than an attorney looking to redline contracts or draft a motion.

Training

Model training has historically required significant resources: vast amounts of training data, hundreds of GPUs, and thousands of hours of human effort to label the training or preference data. Improvements in hardware and software capabilities are changing the calculus of training a model from scratch, but the limiting factor in many situations is still often the training data itself. Many models, both open source and closed, are trained on similar data sets, typically including material from Wikipedia, StackExchange, Github, arXiv, Common Crawl, Books3, PubMed, and more.

One important implication of this shared training data is that, in the event that data from a specific source is tainted and no longer usable (e.g., due to copyright infringement lawsuits), numerous models are exposed to the same risk. Some model users may interpret this as an inevitable risk that has to be accepted, others may have “backup” models for business continuity planning, and still others may opt to use smaller models trained only on public domain material.

For users who are training models, the options are similar: purchase the rights to copyrighted content, accept the legal and regulatory risks, or identify alternative data sources ( like our Kelvin Legal Data Pack).

Tuning

Fine-Tuning

Fine-tuning is the process of further training a model for specific use cases, tasks, or domains. Fine-tuning begins with an existing model and attempts to “teach” the model using additional training data, typically resulting in substantial changes to the parameters of the model. In practice, fine-tuning can take on many forms; sometimes it is limited to retraining only limited portions of a model, and in other cases, it may result in nearly all of a model’s parameters being altered. In general, fine-tuning requires many fewer resources than the original training of a foundation model, both in terms of compute resources and training data.

For the legal industry, fine-tuning is considered by many to be the most effective way to build deeper domain knowledge or stylistic behavior into a general-purpose language model like GPT or Llama2.

Delta-Tuning

Although delta-tuning is a more recent technique, the concept itself is essentially small-scale fine-tuning. Where fine-tuning might result in changing all parameters in a model, delta-tuning involves limiting changes to a specific subset of the neural network. Delta-tuning is an active area of research, but it’s clear that it offers a much faster, cost-effective route to adapting foundation models.

Similar to the larger-scale fine tuning process, delta-tuning requires the use of additional training data. Ideally, the training data is as closely related to the desired domain and/or task at hand as possible; for example, with a model that will be used exclusively for labor and employment matters, the training data should include employment agreements, stock option plans, labor laws and rules, related case law, and other employment material.

Retrieval Augmentation

Retrieval augmentation - also known as retrieval-augmented generation (RAG) - is a process in which a model’s “internal knowledge” is combined with external sources of information to support question-answering or text generation tasks.

Typically, retrieval-augmentation relies on a database of source or reference material, such as Federal statutes or court opinions. When a user asks a question or a workflow generates a query, this query is first used to retrieve relevant statutes or opinions. Sometimes, this might happen through simple keyword search, but in many cases, more sophisticated techniques such as embeddings or knowledge graphs are used to retrieve the most relevant source material.

Once this source material is identified, it is then combined with the user’s original query to generate a new prompt for a large language model. When this source material is of high-quality, such techniques can dramatically reduce the likelihood of hallucination or increase the completeness and accuracy of generated output.

While such retrieval-based techniques do not typically rely on retraining or fine-tuning a foundation model, they do typically rely heavily on high-quality data for three purposes: source material curation, embedding model training, and knowledge graph extraction. Therefore, high-quality, domain-specific data is essential for the successful implementation of retrieval-augmented workflows or applications. Notably, the best data for such systems is often a combination of internal organizational knowledge like policies or FAQs with external public information like laws and rules.

Whether data is used for training foundation models, downstream tuning, or in the vector databases and embedding models of retrieval-augmented approach, it’s clear that the quality of data is critical to the safety and success of an AI system. Organizations that aren’t careful about what they put into such systems shouldn’t be surprised by what they find comes out.

Jillian Bommarito, CPA, CIPP/US/E

Jillian is a Co-Founding Partner at 273 Ventures, where she helps ensure that Kelvin is developed and implemented in a way that is secure and compliant.

Jillian is a Certified Public Accountant and a Certified Information Privacy Professional with specializations in the United States and Europe. She has over 15 years of experience in the legal and accounting industries.

Would you like to learn more about risk management for AI-enabled legal tools? Send your questions to Jillian by email or LinkedIn.