Skip to main content

Retrieval-augmented generation (RAG)

Open In Colab

Use case

Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents.

LLMs, given their proficiency in understanding text, are a great tool for this.

In this walkthrough we'll go over how to build a question-answering over documents application using LLMs.

Two very related use cases which we cover elsewhere are:

intro.png

Overview

The pipeline for converting raw unstructured data into a QA chain looks like this:

  1. Loading: First we need to load our data. Use the LangChain integration hub to browse the full set of loaders.
  2. Splitting: Text splitters break Documents into splits of specified size
  3. Storage: Storage (e.g., often a vectorstore) will house and often embed the splits
  4. Retrieval: The app retrieves splits from storage (e.g., often with similar embeddings to the input question)
  5. Generation: An LLM produces an answer using a prompt that includes the question and the retrieved data

flow.jpeg

Quickstart

Suppose we want a QA app over this blog post.

We can create this in a few lines of code.

First set environment variables and install packages:

pip install langchain openai chromadb langchainhub

# Set env var OPENAI_API_KEY or load from a .env file
# import dotenv

# dotenv.load_dotenv()
# Load documents

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
# Split documents

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(loader.load())
# Embed and store splits

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
# Prompt
# https://smith.langchain.com/hub/rlm/rag-prompt

from langchain import hub

rag_prompt = hub.pull("rlm/rag-prompt")
# LLM

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# RAG chain

from langchain.schema.runnable import RunnablePassthrough

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm
rag_chain.invoke("What is Task Decomposition?")
    AIMessage(content='Task decomposition is the process of breaking down a task into smaller subgoals or steps. It can be done using simple prompting, task-specific instructions, or human inputs.')

Here is the LangSmith trace for this chain.

Below we will explain each step in more detail.

Step 1. Load

Specify a DocumentLoader to load in your unstructured data as Documents.

A Document is a dict with text (page_content) and metadata.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

Go deeper

  • Browse the > 160 data loader integrations here.
  • See further documentation on loaders here.

Step 2. Split

Split the Document into chunks for embedding and vector storage.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

Go deeper

  • DocumentSplitters are just one type of the more generic DocumentTransformers.
  • See further documentation on transformers here.
  • Context-aware splitters keep the location ("context") of each split in the original Document:

Step 3. Store

To be able to look up our document splits, we first need to store them where we can later look them up.

The most common way to do this is to embed the contents of each document split.

We store the embedding and splits in a vectorstore.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

Go deeper

  • Browse the > 40 vectorstores integrations here.

  • See further documentation on vectorstores here.

  • Browse the > 30 text embedding integrations here.

  • See further documentation on embedding models here.

    Here are Steps 1-3:

lc.png

Step 4. Retrieve

Retrieve relevant splits for any question using similarity search.

This is simply "top K" retrieval where we select documents based on embedding similarity to the query.

question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)
    4

Go deeper

Vectorstores are commonly used for retrieval, but they are not the only option. For example, SVMs (see thread here) can also be used.

LangChain has many retrievers including, but not limited to, vectorstores.

All retrievers implement a common method get_relevant_documents() (and its asynchronous variant aget_relevant_documents()).

from langchain.retrievers import SVMRetriever

svm_retriever = SVMRetriever.from_documents(all_splits, OpenAIEmbeddings())
docs_svm = svm_retriever.get_relevant_documents(question)
len(docs_svm)
    4

Some common ways to improve on vector similarity search include:

import logging
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

retriever_from_llm = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(), llm=ChatOpenAI(temperature=0)
)
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

In addition, a useful concept for improving retrieval is decoupling the documents from the embedded search key.

For example, we can embed a document summary or question that are likely to lead to the document being retrieved.

See details in here on the multi-vector retriever for this purpose.

mv.png

Step 5. Generate

Distill the retrieved documents into an answer using an LLM/Chat model (e.g., gpt-3.5-turbo).

We use the Runnable protocol to define the chain.

Runnable protocol pipes together components in a transparent way.

We used a prompt for RAG that is checked into the LangChain prompt hub (here).

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

from langchain.schema.runnable import RunnablePassthrough

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm

rag_chain.invoke("What is Task Decomposition?")
    AIMessage(content='Task decomposition is the process of breaking down a task into smaller subgoals or steps. It can be done using simple prompting, task-specific instructions, or human inputs.')

Go deeper

Choosing LLMs

  • Browse the > 90 LLM and chat model integrations here.
  • See further documentation on LLMs and chat models here.
  • See a guide on local LLMS here.

Customizing the prompt

As shown above, we can load prompts (e.g., this RAG prompt) from the prompt hub.

The prompt can also be easily customized, as shown below.

from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
rag_prompt_custom = PromptTemplate.from_template(template)

rag_chain = (
{"context": retriever, "question": RunnablePassthrough()} | rag_prompt_custom | llm
)

rag_chain.invoke("What is Task Decomposition?")
    AIMessage(content='Task decomposition is the process of breaking down a complicated task into smaller, more manageable subtasks or steps. It can be done using prompts, task-specific instructions, or human inputs. Thanks for asking!')

We can use LangSmith to see the trace.