Exercise 02: Retrieval-Augmented Generation (RAG)
In this section, we will implement a RAG system that combines the language model with a document retrieval system.
The embeddings model is also deployed using llama.cpp and exposes a slightly different API on port 8081.
A RAG system is implemented as follows:
- Calculate embeddings on documents inside the knowledge base.
- Calculate the embedding of the user query.
- Store the document embeddings in a vector database (for simplicity, we will use an in-memory vector store).
- Get the most similar documents from the knowledge base using the query embedding, with a metric such as cosine similarity.
- Pass the retrieved documents as context to the language model and generate a response.
Here are some examples that you can add to the database and ask questions about them:
- The secret code to access the project is 'quantum_leap_42'.
- Alice is the lead engineer for the new 'Orion' feature.
- The project deadline has been moved to next Friday.
For this exercise, solve TODO 2 to implement the document retrieval logic.