Retrieval Augmented Generation (RAG) using Large Language Models (LLM) for Vector-based Document Database Integration

RAG extend the efficiency of LLMs based on these core ideas and acknowledgements:

1) LLMs have limitations, such as (cf. Gao et al. 2023, p. 1)

  • hallucination
  • outdated knowledge
  • opaque reasoning processes
  • Biases and Toxicity

2) RAG is a promising solution to these challenges by (cf. Gao et al. 2023, p. 1)

  • incorporating external knowledge bases,
  • enhancing the models’ accuracy and credibility, particularly for knowledge-intensive tasks,
  • Allows continuous updates and integration of domain-specific information, leveraging synergy between LLMs’ intrinsic knowledge and external databases.

3) Workflow of RAG:

RAG involves (cf. Gao et al. 2023, p. 1)

  • querying external data sources,
  • Integrating retrieved information into prompts
  • and generating informed responses.

4) Technological Enhancements in RAG Components

These include (cf. Gao et al. 2023, p. 1)

  • data indexing,
  • retrieval precision,
  • generation quality, and
  • augmentation methods
  • optimizing embedding models for relevance, and
  • innovative post-retrieval strategies

5) RAG Research Frameworks paradigms

Gao et al. (2023) define

  • Naive RAG
  • Advanced RAG
  • Modular RAG

as new modules and patterns within the Modular RAG, such as

  • search,
  • memory,
  • fusion,
  • routing,
  • predict, and
  • task adapter modules,

illustrating the flexibility and adaptability of RAG to various applications (cf. Gao et al. 2023, p. 1).

6) Evaluation and Future Directions

This category is adressing (cf. Gao et al. 2023, p. 1).

  1. metrics and benchmarks for assessing RAG models and current challenges,
  2. expanding into multi-modal settings, and
  3. developing the RAG infrastructure and ecosystem.
  4. need for enhancing semantic representations, aligning the semantic spaces of queries and documents, and the retriever’s output

7) Enhancing Semantic Representations

The accuracy of semantic representations in RAG is pivotal, involving

  1. optimal chunking of documents for efficient retrieval and
  2. embedding models’ fine-tuning to ensure precise domain-specific knowledge capture and
  3. techniques like sliding windows for layered retrieval and
  4. abstract embedding for prioritizing top retrievals based on summaries or abstracts (cf. Gao et al. 2023, p. 10).

8) Aligning Queries and Documents

This methods encompasses (cf. Gao et al. 2023, p. 10)

  • the alignment between queries and documents’ semantic spaces by query rewriting and embedding transformation.
  • generating pseudo-documents and
  • employing adapters post-query encoding to optimize the representation of query embeddings, improving retrieval relevance and addressing the challenge of aligning structured and unstructured data .

9) Aligning Retriever and LLM

To improve the final outcomes of RAG models, techniques for fine-tuning retrievers based on LLM preferences are crucial for the usage of feedback signals from LLMs to refine retrieval models and employing adapters to aid in alignment (cf. Gao et al. 2023, p. 10).

Model Formulation

Firstly lets outline the foundational concepts of RAG including key elements within the RAG framework (cf. Li et al. 2022, p. 1ff.):

  • Source of retrieval
  • Metrics for retrieval
  • Methods for integrating retrieved information

The basis for many text generation activities is the transformation of an input sequence x into an output sequence y, symbolized as

y = f(x)

This transformation might represent various tasks such as converting a series of dialogue interactions into a fitting reply, translating text from a source to a target language, among others. Recent advancements have introduced the concept of equipping models with the ability to draw upon an external memory using information retrieval techniques, enriching the information available during the generation process (cf. Li et al. 2022, p. 1ff.).

The RAG model extends this concept as

y = f(x,z),


z = {xr,yr}

represents a collection of relevant instances either from the training dataset or external sources.

The principle here is that the generation of y could be enhanced if xr (or yr) bears similarity (or relevance) to the input x. It’s important to note that xr can be null when leveraging unsupervised retrieval sources (cf. Li et al. 2022, p. 1ff.).

Generally, sources for retrieval memory can emanate from the training dataset, external datasets formatted similarly to the training dataset, or vast unsupervised corpora (§2.2).

There are also diverse metrics to assess text relevance, categorized into sparse-vector retrieval, dense-vector retrieval, and training-based retrieval in §2.3.

Additionally, the strategy for amalgamating retrieval memory into the generation model holds significance, with several prevalent techniques discussed in §2.4 (cf. Li et al. 2022, p. 1ff.).

Mathematical Model of RAG Interfacing with Vector-Based Models

Given the context of RAG (Retrieval-Augmented Generation) interfacing with vector-based models, let’s delineate a detailed mathematical model that explicates how RAG can extend and enhance the capabilities of such models for text generation tasks:

  1. Vector Representation of Documents and Questions
    • Firstly, every document Dk in the dataset, including both oracle D∗ and distractor Didocuments, is represented as a high-dimensional vector using a vector-based model. Similarly, a question Q is also transformed into a vector representation:
    • V(Dk) = vec(Dk), where vec(·) is the function that converts text to vector space.
    • V(Q) = vec(Q)
  2. Retrieval Process
    • The retrieval process involves
      • calculating the similarity between V(Q) and each V(Dk​) to identify the most relevant documents.
    • The similarity can be measured using metrics such as cosine similarity:
      • Sim(V(Q), V(Dk)) = frac(V(Q) · V(Dk)){||V(Q)|| ||V(Dk)||} or
  3. Fine-Tuning with Selected Documents
    • Using the selected documents, the model is fine-tuned to generate the answer A∗ from the provided question Q and documents Dk​. The fine-tuning process involves optimizing a loss function L that measures the discrepancy between the generated answer and the true answer A:
      • The model generates an answer Á based on Q and the retrieved documents: 
        • Á = Model(Q, Dk)
      • The loss function L(Á, A*) is minimized during training.
  4. Extending the Model
    • To extend the RAG model, we incorporate an iterative refinement process, allowing the model to iteratively update its understanding and selection of relevant documents based on the intermediate generation results:
    • After an initial generation, the model revises Q or augments its context with additional information based on Á.
    • The retrieval and generation process may be repeated, refining the selection of Dk and the generation of Á until a stopping criterion is met.
  5. Evaluation in Vector Space
    • Post-generation, the quality of Á can be evaluated by projecting both Á and A∗ into the vector space and measuring their similarity, offering a quantifiable metric for assessing model performance on generating contextually and factually accurate responses.

Retrieval Augmented Fine Tuning (RAFT)

RAFTproposes a specialized strategy for preparing fine-tuning data, aimed at customizing models for domain-specific open-book scenarios, paralleling the in-domain Retrieval-Augmented Generation (RAG) approach. At the heart of RAFT is the structured composition of training data, encompassing a question Q, a set of documents Dk, and a Chain-of-Thought (CoT) style answer A* derived from one or more ‘oracle’ documents D*, alongside ‘distractor’ documents Di that lack pertinent answer information. It’s important to note that the ‘oracle’ document might consist of multiple documents, as seen in HotpotQA. (cf. Zhang et al. 2024, S. 3)

The data preparation method within RAFT specifies that for a P percentage of the questions qi in the dataset, the oracle document d*i is included with the distractor documents dk-1. Conversely, for the remaining 1 – P percentage of questions, no oracle document is included, only distractor documents dk. This training framework employs the Standard Supervised Training (SFT) method, directing the model to formulate answers based on the questions and documents provided. The core design principle of RAFT is to enhance the model’s RAG performance within the trained document set, focusing on in-domain efficiency. By excluding oracle documents from part of the training data, the model is incentivized to memorize answers rather than infer them from context. (cf. Zhang et al. 2024, S. 3)

The RAFT training data is mathematically outlined as follows (cf. Zhang et al. 2024, S. 3):

  • For P% of the data: Q + D* + D2 + … + Dk -> A*
  • For (1 – P)% of the data: Q + D1 + D2 + … + Dk -> A*

During testing, the model is supplied with the question Q and the top-k documents retrieved by the RAG process, underscoring RAFT’s retriever independence. This structured approach ensures the model’s optimal readiness for generating precise, domain-specific answers in open-book settings.

Retrieval Sources

There are various sources from which retrieval-augmented models derive external memory, specifically focusing on the training corpus, external datasets, and unsupervised data (cf. Li et al. 2022, p. 1ff.).

Training Corpus

Numerous investigations have utilized the training corpus as a source for external memory. The primary goal of these endeavors is to encapsulate knowledge not just within the parameters of the model but in a format that is both explicit and retrievable, thereby enabling the model to access this information again during inference (cf. Li et al. 2022, p. 1ff.).

External Data

Researchers suggest retrieving pertinent examples from datasets external to the training corpus. The pool of data for retrieval in these cases differs from the training corpus and can introduce extra insights not found within the training data. This approach is notably advantageous for tasks that require domain adaptation or the incorporation of new knowledge. Instances of this method include utilizing datasets specific to a domain as external memory to facilitate rapid domain adaptation in machine translation tasks (cf. Li et al. 2022, p. 1ff.).

Unsupervised Data

A significant drawback associated with the aforementioned sources is their reliance on supervised datasets, which necessitate datasets consisting of aligned input-output pairs. To address this, Cai et al. (2021) introduced a strategy for machine translation involving a cross-lingual retriever that can directly pull target sentences from an unsupervised corpus, specifically a monolingual corpus in the target language (cf. Li et al. 2022, p. 1ff.).


Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., … & Wang, H. (2023). Retrieval-augmented generation for large language models: A surveyarXiv preprint arXiv:2312.10997.

Li, H., Su, Y., Cai, D., Wang, Y., & Liu, L. (2022). A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110.

Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., & Gonzalez, J. E. (2024). RAFT: Adapting Language Model to Domain Specific RAG. arXiv preprint arXiv:2403.10131.

Zhou, P., Pujara, J., Ren, X., Chen, X., Cheng, H. T., Le, Q. V., … & Zheng, H. S. (2024). Self-discover: Large language models self-compose reasoning structuresarXiv preprint arXiv:2402.03620.