the ai playbook part 3 - agents and rag and fine tuning, oh my!

sunk thought’s - the ai playbook

the ai playbook” is a mini-book about integrating ai within your software engineering organization. This is part 3, on how to use Agents, RAG, and Fine Tuning to extend your local/private ai toolkit. I can’t be sure how long there will be between parts, as this isn’t completely written. But, I want to avoid hoarding this as timeliness is of the essence in this space (things are changing rapidly) rather than save it all up until the whole thing is finished.

How to Use Agents, RAG, and Fine Tuning to Enhance Your Personal AI Toolkit

Three of the most powerful tools for extending your local Large Language Model (LLM) today are Agents, Retrieval-Augmented Generation (RAG), and Fine Tuning. My hope is that further understanding the basics of how these work and when to use them will help anyone grasp the potential applications in their personal AI toolkit (like the one we created in Part 2 of this series.)

In addition to diving into the basics of Agents, RAG, and Fine-Tuning we’ll also build upon the AI toolkit we created in part 2 as I walk you through a pair of solutions for creating a custom local LLM implementation that attempts to mimic your personal writing style. We'll cover gathering your data and then show you how to use that data two ways: RAG (easy) using Open WebUI’s built in tools, and LoRA Retraining (advanced) of a given LLM model using Apple’s MLX on macs (with links to other options for linux/windows folks.)

By the end, you’ll have a better understanding of what these tools are all about, how to use them, where to learn more about them, and have a more personalized LLM experience running locally on your computer!

Agents: The Backbone of AI Systems

Agents are autonomous entities that perceive their environment through sensors and act upon it using actuators. They’re software programs designed to perform specific tasks without human intervention. Implementing an agent usually involves several key components:

  1. Sensors : Devices or software modules that collect data from the environment (e.g., cameras, microphones, or APIs).

  2. Actuators : Components that enable the agent to interact with the environment (e.g., motors, speakers, or user interfaces).

  3. Knowledge Base : A repository of information that the agent uses to make decisions.

  4. Decision-Making Algorithm : The core logic that processes sensor data and determines the agent's actions.

Valuable Use Cases for Agents

There are as many possible uses for AI Agents as you can imagine, but the big categories of agents are Chatbots, Aggregators, and Realtime Monitors.

Chatbots:

  • Virtual assistants like Siri and Alexa are examples of agents that interact with users, answer questions, and perform tasks. They use natural language processing (NLP) to understand user inputs and generate responses.

  • Chat agents can gather customer data from various sources to provide personalized support and recommendations.

Aggregators:

  • Collect news articles from multiple sources to create a personalized news feed or perform sentiment analysis on current events.

  • Search, scrape, and format newly published specialty knowledge into an archive for use in RAG.

Realtime Monitors:

  • Agents can collect patient data from wearable devices and medical records to monitor health conditions.

  • Devices like thermostats and security systems can operate as agents, adjusting settings based on user preferences and environmental data. They use machine learning to learn from user behavior and optimize usage.

Learn More About Agents

Huggingface, creator of smolAgents, has a brand new course I recommend, The HuggingFace Agent Course.

Additionally, LangChain is an incredibly popular way to create agents. There’s a free course here on the basics.

Setup for our Example

We’re going to run an experiment across the next two sections on RAG and Fine Tuning aimed at getting a custom LLM experience that we can use on our private personal AI toolkit to write content more like we ourselves would.

Gathering Writing Samples

First, we need to collect as many writing samples as possible and put them together somewhere on your computer:

  1. Create a folder (e.g., writing_samples) to store PDFs, articles, emails, note books, and/or blog posts. Anything you got, bring it on.

  2. Organize the files you have in this folder using subfolders for each type of writing for further refining usage (ex: writing emails, tweets, etc.)

  3. Don’t worry about formatting for Fine Tuning or RAG yet. We will address this later.

Now that we have as many samples of our writing as possible, we can move on to the next part… In the sections that follow, we’ll take this data and then create a customized LLM instance that mimics our own voice so we can see which implementation we like better in what cases.

(Added 2/20/25) Using the learnings of the next section, I created a “Custom Model” in Open WebUI called “Sir David Attenborough” to reply to me as though it was the famed nature documentary narrator… I scraped YouTube videos (this is personal use only, don’t worry) and stored the transcripts in a folder for my LLM’s RAG knowledge base (in hopes of better nature analogies and vocab.) I also gave it an elaborate system prompt about Sir David and how it should write. I forgot I had done this, and came back the next morning to blindly ask it a question …then the reply came and milk shot out of my nose!

The world we live in is weird these days, eh?

Retrieval-Augmented Generation (RAG): Enhancing AI Responses

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based methods with generative models to produce more accurate and contextually relevant responses. It retrieves relevant information from a knowledge base before generating a response, ensuring higher quality outputs. The key components of RAG include:

  1. Retriever : A model or algorithm that searches the knowledge base for relevant information based on the input query.

  2. Generator : A generative model (e.g., a language model) that produces responses using the retrieved information.

  3. Knowledge Base : A structured repository of data, such as documents, databases, or APIs.

Valuable Use Cases for RAG

  • Medical Diagnostics: RAG can retrieve medical records and research papers to assist doctors in diagnosing diseases accurately, providing contextually relevant information.

  • Research: Enhance accuracy and citation generation for all kinds of research. Ex: Lawyers can use RAG to retrieve relevant case law and legal precedents, enhancing the accuracy of their arguments and decisions.

  • Knowledge Base: Query this data for rapid discovery of data at large organizations to help employees find information scattered across their internal organization (ex: “I need a new laptop, how do I do requisition or expense one?” or “What is the bereavement leave policy?” or “What were total sales for productX in Q4?”)

  • Data Freshness: Keep your LLM up to date with all the latest information you deem valuable since its creation date, allowing it to maintain relevance on important subject matter.

  • Content Creation: Writers and journalists can use RAG to generate articles by retrieving relevant information from a vast database of sources, ensuring high-quality and contextually accurate content.

Example: Using Open WebUI’s RAG Tools With Our Data

In the previous part of this playbook we installed Ollama and Open WebUI for interacting with local LLMs. For personal use, Open WebUI has some pretty powerful RAG features built in, which we’ll use here to try and get our assistant to write in our personal style.

Adding Your Files to a Knowledge Base

Now that we have all our data gathered up and separated into folders, let’s go to our Open WebUI instance in our web browser and click on “Workspace” and then select the “Knowledge” tab. Click the plus symbol to create a new knowledge base and simply upload your entire base directory.

Luckily, they do a lot of the heavy lifting of parsing PDFs and so forth for us, so no need to clean this up too much in this example.

Updating the Documents Embedding Model

The goal of an embedding model is to take the text that we provide and convert it to a numerical representation. The cool thing is you can find embedding models that have been trained on various domains. So, if you write about a specialty field most often like data science, healthcare, finance, aerospace, etc there’s likely an open source embedding model you can use to further enhance your results.

System Prompts

Another thing I recommend is using System Prompts to help the LLM mimic our style. You can give it instructions on tonality, formality, and any other instructions on how to be more like you.

Go create a new chat. Click the “Controls” button on the upper right (it looks like a circle with sliders inside it.) Within the system prompt we can include something like the following.

Instructions: 

You are writing drafts for Wil Everts, a lifelong technologist with more than 25 years of software engineering and leadership for innovative technology companies in Cloud, Enterprise, and Web3 applications. 

Be intellectually curious and bring a sense of humor to your writing.

Don't be too formal. Write conversationally in an inclusive manner that seeks to instruct readers as simply as possible. Try to use lists as little as possible, instead write paragraphs.

Avoid using rote words or phrases or repeatedly saying things in the same or similar ways. Vary your language just as one would in a conversation.

Use fun metaphors and parables to teach complex problems in order to make technical subjects more relatable to non-technical readers.

Provide thorough responses to more complex and open-ended questions or to anything where a long response is requested, but concise responses to simpler questions and tasks.

Then, without invoking your knowledge base at all, test how much better you can make the writing style just by updating the system prompt. Once you’ve gotten as far as you can with that we’re ready to put it all together.

Putting It All Together in a “Custom Model”

Let’s put this all together by creating a “Custom Model.” Go to “Workspace,” “Custom Models” in Open WebUI. Create a new model. Select your desired LLM, enter your system prompt from before, and maybe update the context length to the full model capability (ollama defaults to only 2k tokens) and temperature while you’re at it.

Next, under “Knowledge” select your Open WebUI knowledge base and finish your model.

Now, go back to create a new chat with your new custom model and enjoy!

I’m sure you can already imagine all sorts of other usages of this, but here’s a couple more:

  • Use RAG to help you find the most qualified candidates for a given job’s requirements from a folder full of resumes

  • Use RAG to identify legal precedents from a knowledge base of standing case summaries

Pros of Using RAG for This Use Case

Contextual Relevance: RAG ensures that the generated text is contextually relevant to your personal writing style.

Ease of Implementation: Open WebUI makes RAG so easy that anyone can do it!

Learn More About RAG

Learn more about using the RAG features in Open WebUI for use with your toolkit. There’s also a repository of additional tools you can install to extend Open WebUI’s capabilities.

For corporate use cases, Ragie is a cool service that can pull data from various cloud services like notion, google docs, etc.

Fine Tuning Large Language Models

Retraining involves updating an AI model with new data to improve its performance and adapt to changing conditions. This process is crucial for maintaining the accuracy and relevance of AI systems over time. Key components of fine tuning include:

  1. New Data : Fresh information collected from various sources, such as user interactions, sensor readings, or external databases.

  2. Model Architecture : The structure of the AI model, which may need adjustments to accommodate new data.

  3. Training Algorithm : The method used to update the model's parameters based on the new data (e.g., gradient descent, stochastic gradient descent).

Valuable Use Cases for Fine Tuning

  • Natural Language Processing (NLP) and Multilingual Models: Fine-tune a multilingual LLM to improve performance in low-resource languages by adapting it to language-specific datasets.

  • Personalized Content Creation: Fine-tune a pre-trained LLM to generate content in your personal writing style by adapting it to your specific dataset of blog posts, articles, and social media updates.

  • Domain-Specific Models (Law, Healthcare, etc): Adapt a general-purpose LLM to a specific domain, such as healthcare or finance, by fine-tuning it on domain-specific data.

  • Continuous Learning: Continuously update the model with new data without forgetting previously learned information, making it suitable for applications that require ongoing adaptation.

LoRA Fine Tuning with MLX

We are going to stick with our theme of doing things easily, locally, and on a budget. So, to fine tune our model we’re going to use Low-Rank Adaptation (LoRA.) LoRA allows you to adapt a pre-trained model to new tasks or domains without retraining the entire model from scratch. This approach is particularly useful when dealing with LLMs that have billions of parameters, as it significantly reduces the computational resources and time required for the fine-tuning.

Most importantly for our use case: by updating only the low-rank matrices, LoRA requires significantly less computational power compared to full fine-tuning and the training process is much faster due to the smaller size of the low-rank matrices. Storing and updating low-rank matrices also require less memory, making it feasible to fine-tune large models on hardware with limited resources.

In other words, rather than renting massive racks of GPUs to completely retrain a model, MLX and LoRA let us fine tune it for free on our laptop instead.

How LoRA Retraining Works

In LoRA we start with a pre-trained LLM that has already learned general language patterns and structures and introduce low-rank matrices into the model's layers. During this process, the low-rank matrices are updated to capture the specific patterns and structures of the new data, while the original weights provide the general language understanding. Finally, during inference, the adapted model uses both the original weights and the updated low-rank matrices to generate responses. This combination allows the model to leverage its pre-trained knowledge while also incorporating the new information — or as Matt Williams put it, “It’s like training a chef that’s already good at cooking to cook a specific cuisine.”

Example: Processing and Fine Tuning Our Model

Since this post is getting very long and this concept is by far the most complicated, I’m going to leave you in the capable hands of Matt Williams for the final step. He was a member of the original Ollama team and is super knowledgable! He also has fantastic basic and advanced playlists for using Ollama you might want to check out for your expanding personal toolkit’s benefit.

In this video Matt very patiently walks you through the basics and the pitfalls in the process of preparing your data and retraining the model.

I’m not going to lie to you. This took me a while to get down, particularly preparing my data for training… Thankfully Matt does address this in the video, but it still took me a bit.

I also found that, unsurprisingly, I didn’t quite have enough data with my blog posts. I hunted down all my “Day One” archives going back a few years, and the last 100 emails I’ve sent. I also used a script that my AI assistant wrote to pull down my past tweets and facebook posts to add to my fairly small dataset, and included some additional industry writings that have influenced my work philosophy.

Then I converted any PDFs in that collection using another script my AI assistant wrote for me, and followed Matt’s directions from there for preparing my data and fine tuning with MLX.

Pros of Using Fine Tuning for This Use Case

Personalization: Fine Tuning allows the model to better capture your personal writing style than RAG.

Adaptability: The model can adapt to new data and improve over time.

Learn More About Fine Tuning

MLX is the best choice for fine tuning on Macintosh silicon.

Unsloth is the big thing for MLX-like fine tuning for Windows and Linux folks right now. Matt also has a video on using Unsloth for fine tuning.

When to Fine Tune and When to Use RAG

When it comes to enhancing your personal AI toolkit, you have two super powerful options available for no cost: Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) retraining. Each approach has its own strengths and weaknesses, making them better suited for different tasks. Let's dive into the trade-offs between these methods.

Retrieval-Augmented Generation (RAG)

Strengths

  1. Contextual Relevance: RAG ensures that the generated text is contextually relevant to your personal writing style by retrieving specific information from a knowledge base before generating a response.

  2. Efficiency: RAG allows you to generate responses quickly without retraining the entire model, which can be labor and hardware-intensive.

  3. Flexibility: RAG can be easily adapted to different tasks by updating the knowledge base with relevant information.

  4. EASE OF USE: For a non-technical person, just using the Open WebUI GUI makes RAG a pretty appealing choice if intimidated by fine tuning.

Weaknesses

  1. Dependency on Knowledge Base: The quality of RAG-generated responses heavily depends on the comprehensiveness and accuracy of the knowledge base.

  2. Limited Personalization: While RAG can mimic your writing style to some extent, it may not capture the nuances as effectively as retraining.

Best Suited For

  • Content Creation

  • Customer Support

  • Research Assistance

Low-Rank Adaptation (LoRA) Retraining/Fine-Tuning

Strengths

  1. Personalization: LoRA retraining allows the model to better capture your personal writing style by adapting it to your specific dataset of blog posts, articles, and social media updates.

  2. Adaptability: The model can adapt to new data and improve over time, making it suitable for applications that require ongoing adaptation.

  3. Efficiency: LoRA requires significantly less computational power compared to full fine-tuning, making it feasible to retrain large models on hardware with limited resources.

Weaknesses

  1. Complexity: Retraining a model can be more complex and time-consuming than setting up a RAG pipeline.

  2. Resource Intensive: While LoRA is more efficient than full fine-tuning, it still requires more computational resources than RAG.

Best Suited For

  • Personalized Content Creation

  • Domain-Specific Models

  • Continuous Learning

Specific Task Comparisons

Task: Generating Personalized Blog Posts

RAG:

  • Quickly generates blog posts by retrieving relevant information from your knowledge base.

  • Ensures contextual relevance but may lack the personal touch of retrained models.

LoRA Retraining:

  • Captures the nuances of your writing style more effectively.

  • Requires more time and resources to set up but provides a more personalized output.

Task: Enhancing Customer Support Chatbots

RAG :

  • Efficiently retrieves specific customer data or past interactions to provide personalized support.

  • Quick to implement and update with new information.

LoRA Retraining:

  • Can adapt the chatbot's responses to better match your brand's voice and tone.

  • More resource-intensive but offers a higher level of personalization.

Task: Assisting in Legal Research

RAG :

  • Retrieves relevant case law and legal precedents quickly, enhancing the accuracy of arguments and decisions.

  • Ideal for tasks that require rapid access to specific information.

LoRA Retraining :

  • Can be fine-tuned to understand legal terminology and structures more deeply.

  • Better suited for tasks that require a comprehensive understanding of legal concepts.

Both RAG and fine-tuning have their own advantages and disadvantages. RAG is ideal for tasks that require quick access to specific information and contextual relevance, while LoRA retraining is better suited for tasks that demand a high level of personalization and adaptability.

We’ve covered a lot. Hopefully by now you have a better feel for Agents, RAG, and Retraining a Large Language Model.

How did the experiments work out for you?

For me, retraining was actually pretty easy after I got my data formatted, and it did a better job of steering it’s tone than RAG did. But, I did still create a custom model in Open WebUI with my special system prompt from the RAG section to help take it to the next level.


Thanks for reading! In part 4 we’re going to get into how to think about implementing AI organizationally so that you can avoid common pitfalls that have prevailed to date.

Previous
Previous

what cinematography taught me about technology

Next
Next

study: are your engineers on autopilot while using copilot?