How Demonstration Datasets and RLHF Drive the Success of LLMs

Over the past few years, chat engines like ChatGPT, Gemini, and others have transformed how we interact with artificial intelligence (AI). Based on large language models (LLMs), these systems have unlocked new capabilities in text generation, translation, and writing computer code. You could even get them to set up a workout routine tailored to your specific needs.

However, the most common use case for LLMs today is as conversational chatbots driving customer engagement. You’ve seen them pop up on ecommerce, banking, and a host of everyday apps with a “How may I help you?” message. Of course, there is no “one size fits all” option here. For it to work across different businesses and industries, a typical LLM like Llama needs to be adapted to address domain-specific queries. This process is known as domain adaptation.

Why LLMs need fine-tuning

With their ability to process and understand language at scale, LLMs are driving innovation and efficiency across industries. From assisting in vaccine development and disease prevention in healthcare to supporting marketing teams with campaign ideation, banks with detecting fraud, and legal professionals with reviewing and drafting documents, these AI systems are making their presence felt across sectors.

But if they are to work seamlessly in specific domains, LLMs need fine-tuning for domain adaptation. This process involves two steps: supervised fine-tuning and reinforcement learning from human feedback (RLHF).

Interestingly, both these steps involve human intervention. Extensive human-in-the-loop (HITL) services are required to carry out fine-tuning with high-quality data and to provide appropriate feedback on the suitability of the responses from the fine-tuned LLM. Only human evaluators can help make sure the LLMs provide appropriate and optimal responses to queries. Human input is essential to mitigate the risk of hallucinations, preventing AI models from generating incorrect or misleading information.

Understanding supervised fine-tuning

Consider the case of an ecommerce platform that is looking to incorporate a conversational AI chatbot into its apps. For this chatbot to work, it will need access to information about all products being sold on the platform, the company’s policies on returns and cancellations, guidelines for people who are unable to login, and a whole lot more. All of this data will help the model to specialise, enabling it to handle user questions with accuracy and improve customer interactions.

However, this is only the first step. A standard LLM needs to be fine-tuned with a specific question-answering dataset as well. Also known as demonstration data, the question-answering dataset includes sample end-to-end conversations between a customer and a customer support agent.

There are different ways of collecting such customer support information for the ecommerce chatbot. Outlined below are three such approaches:

If the platform has voice support, customer support conversations are accessed for different scenarios and used to build a question-answering dataset.
The question-answering dataset is created manually.
A synthetically created question-answering dataset is used.

The goal here is to create “golden datasets”—well-defined and reliable sources of information that help LLMs carry out precise analytics and decision-makingcustomer support conversations. Keeping this in mind, the first option above may seem like a good golden dataset, but that isn’t always the case. Quite often, voice conversations are not considered for this step because removing sensitive personal information and cleaning up the data can be a massive undertaking.

The better option is to get this done manually, which allows teams to also integrate some product-specific details into the conversation. This is a manual-intensive HITL process, where a team is given guidelines along with the specific intents for which the conversations need to be created.

This entirely manual approach usually results in a golden dataset, but the process can be rather expensive. That is why most enterprises opt for an existing synthetic dataset instead—it is more cost-effective. However, as such a dataset is created synthetically, it may contain sensitive information and carry the risk of incorrect handling. SoSo, the dataset needs to be validated through the HITL process using a pre-defined rubric. A rubric is a scoring guide that defines and evaluates specific elements and expectations for an instruction.

How RLHF makes LLMs more “human”

At their core, LLMs are designed to predict the next word or phrase. For instance, if you prompt a model with “The child tossed the ball into the air…”, it might complete the thought with “and caught it”. By using the supervised fine-tuning process, we have enabled the LLM to respond to questions specific to a domain. But when the training dataset is relatively small, there is the risk that the LLM may hallucinate and respond incorrectly.

This is where RLHF steps in, transforming an LLM from a sophisticated word predictor into a tool that is capable of understanding and aligning with human intent in that particular space. RLHF uses a reward system powered by human feedback to teach the model which responses are the most relevant and helpful. So instead of relying on only demonstration data, RLHF incorporates direct feedback from human evaluators.

The LLM gets smarter with each interaction—learning, adapting, and improving based on real-world inputs. This iterative RLHF helps the LLM get better at producing contextually appropriate answers that closely mimic human thought processes.

Quality processes in HITL

LLMs learn from the data they’re trained on. To ensure they perform well, simply provide them with a golden dataset as the demonstration data. When demonstration data is created with clear and precise information, the LLM’s performance improves.

Clear and well-defined rubrics are essential wherever a team is involved in validating either the synthetic data for supervised fine-tuning or the output of the model for RLHF. As this is an evolving practice, HITL teams need to work closely with the customer to develop well-defined and unambiguous rubrics and guidelines for the operations team.

Ensuring accuracy and reliability holds the key here, and quality measures are essential throughout the annotation process. What would this entail? Regular checks as a part of the daily deliverables top the list. Also needed are planned quality assurance checks to identify and correct errors, ensuring there is a golden dataset for the LLMs. One common approach is double-blinded validation to resolve bias and subjectivity in the rubrics.

Fine-tuned LLMs will redefine the future

Now and in the near future, fine-tuned LLMs will shape the future of AI. But HITL data generation and validation for fine-tuning is highly specialised work, and businesses will need support from HITL experts like NextWealth to upgrade their LLMs. NextWealth can help enterprises develop golden datasets for supervised fine-tuning and rubric-driven human evaluation with the aim of ultimately improving the user experience.

After all, as the latest AI models bring transformative applications into every sphere of life and work, fine-tuning processes like domain adaptation and RLHF will hold the key to new possibilities and innovation.