Exercise 02: Retrieval-Augmented Generation (RAG)

In this section, we will implement a RAG system that combines the language model with a document retrieval system.

The embeddings model is also deployed using llama.cpp and exposes a slightly different API on port 8081.

A RAG system is implemented as follows:

Calculate embeddings on documents inside the knowledge base.
Calculate the embedding of the user query.
Store the document embeddings in a vector database (for simplicity, we will use an in-memory vector store).
Get the most similar documents from the knowledge base using the query embedding, with a metric such as cosine similarity.
Pass the retrieved documents as context to the language model and generate a response.

Here are some examples that you can add to the database and ask questions about them:

For this exercise, solve TODO 2 to implement the document retrieval logic.