Building Production-Ready RAG Applications with LangChain and Next.js

Introduction

Retrieval-Augmented Generation (RAG) has emerged as the standard architecture for building domain-specific LLM applications. Instead of fine-tuning large models, RAG retrieves relevant information from a knowledge base and injects it into the LLM's context window.

This guide explores how to build a production-grade RAG pipeline using LangChain, vector databases, and Next.js App Router.

1. The RAG Architecture
2. Setting up the Next.js API Route
- Install Dependencies
- Creating the Retrieval Chain (app/api/chat/route.ts)
3. Data Ingestion: Chunking Strategies
- Advanced Chunking
4. Query Transformations
Conclusion
- Need Help?
License

1. The RAG Architecture

A standard RAG architecture consists of two main phases:

Ingestion Phase: Documents are loaded, chunked, embedded using an embedding model (like text-embedding-3-small), and indexed in a Vector Database (e.g., Pinecone, Qdrant).
Retrieval & Generation Phase: A user query is embedded, relevant chunks are retrieved via semantic search, and an LLM generates an answer based on the retrieved context.

2. Setting up the Next.js API Route

Let's create a Next.js Server Action or API Route to handle the LangChain pipeline. We'll use the official @langchain/openai and @langchain/core packages.

Install Dependencies

npm install @langchain/openai @langchain/core langchain @pinecone-database/pinecone

Creating the Retrieval Chain (`app/api/chat/route.ts`)

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const currentMessageContent = messages[messages.length - 1].content;

  // 1. Initialize Vector Store
  const pinecone = new Pinecone();
  const pineconeIndex = pinecone.Index(process.env.PINECONE_INDEX!);
  const vectorStore = await PineconeStore.fromExistingIndex(
    new OpenAIEmbeddings(),
    { pineconeIndex }
  );

  // 2. Setup the LLM
  const model = new ChatOpenAI({
    modelName: "gpt-4-turbo-preview",
    temperature: 0,
  });

  // 3. Create the chains
  const prompt = ChatPromptTemplate.fromTemplate(`
    Answer the following question based only on the provided context:

    <context>
    {context}
    </context>

    Question: {input}
  `);

  const documentChain = await createStuffDocumentsChain({
    llm: model,
    prompt,
  });

  const retrievalChain = await createRetrievalChain({
    combineDocsChain: documentChain,
    retriever: vectorStore.asRetriever({ k: 4 }), // Retrieve top 4 chunks
  });

  // 4. Invoke and stream
  const response = await retrievalChain.invoke({
    input: currentMessageContent,
  });

  return Response.json({ text: response.answer });
}

3. Data Ingestion: Chunking Strategies

The quality of a RAG system heavily depends on how data is chunked during ingestion.

A standard approach in LangChain utilizes the RecursiveCharacterTextSplitter:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("data/knowledge_base.pdf");
const docs = await loader.load();

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200, // Important to maintain context across chunks
});

const splitDocs = await textSplitter.splitDocuments(docs);

Advanced Chunking

For production systems, consider Semantic Chunking. Instead of splitting blindly by character count, semantic chunkers analyze the embeddings of sentences to split documents at logical boundaries (like paragraphs or topic changes).

4. Query Transformations

Often, a user's raw query isn't optimized for vector search. Using Query Transformations can drastically improve retrieval recall.

Multi-Query Retrieval: Using an LLM to generate multiple variations of the query and combining the retrieved results.
HyDE (Hypothetical Document Embeddings): Using an LLM to generate a hypothetical answer to the query, and embedding the answer instead of the query to search the vector database.

Conclusion

Building RAG pipelines in Next.js using LangChain provides a robust framework for enterprise AI applications. By focusing on chunking strategies and advanced retrieval mechanisms, you can solve the hallucination problem and build trustworthy AI workflows.

Need Help?

Looking to integrate advanced RAG systems or AI copilots into your infrastructure? Let NeutronLabs architect intelligent workflows for your teams.

License

MIT