TrustRAG: Enhancing Robustness and Trustworthiness in RAG

1Imperial College London
2Peking University,
*Equal Contribution

Abstract


Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. However, these systems remain vulnerable to corpus poisoning attacks that can significantly degrade LLM performance through the injection of malicious content. To address these challenges, we propose TrustRAG, a robust framework that systematically filters compromised and irrelevant content before it reaches the language model. Our approach implements a two-stage defense mechanism: first, it employs K-means clustering to identify potential attack patterns in retrieved documents based on their semantic embeddings, effectively isolating suspicious content. Second, it leverages cosine similarity and ROUGE metrics to detect malicious documents while resolving discrepancies between the model's internal knowledge and external information through a self-assessment process. TrustRAG functions as a plug-and-play, training-free module that integrates seamlessly with any language model, whether open or closed-source, maintaining high contextual relevance while strengthening defenses against attacks. Through extensive experimental validation, we demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches across multiple model architectures and datasets. We have made TrustRAG available as open-source software at https://github.com/HuichiZhou/TrustRAG.

Our Method: TrustRAG


Figure 1: The TrustRAG framework defends against corpus poisoning attacks in RAG systems through a five-step process: (1) identifying malicious documents using K-means clustering, (2) filtering malicious content based on embedding distribution, (3) extracting internal knowledge for accurate reasoning, (4) resolving conflicts by grouping consistent documents and removing irrelevant or conflicting ones, and (5) generating the reliable final answer.

Experimental Results

  • We conduct comprehensive experiments compared with different defense frameworks and RAG systems under the attacks of two kinds of popular methods (PIA (Zhong et al., 2023) and PoisonedRAG (Zou et al., 2024)) based on the three kinds of large language models. The more detailed results of PoisonedRAG in different poison rate can be found in the Table 4, Table 5, and Table 6.

Main Results

Table 1: Main Results show that different defense frameworks and RAG systems defend against two kinds of attack methods based on three kinds of large language models.

Detailed Analysis of TrustRAG

(1) Effectiveness of K-means Filtering Strategy

(1.1) Distribution of Poisoned Documents

  • Figure below plot a case in which samples from the NQ data set are used in different numbers of poisoned documents. It can be seen that in the scenario of multiple malicious documents, the malicious documents are close to each other, as for a single poisoned document, which will mix in the clean documents. Therefore, it is important to use the n-gram preservation to preserve the clean documents

Figure5: The embedding distribution of retrieved documents

(1.2) N-gram preservation and Embedding Models

Table 2: Results on various datasets with different Poisoning levels and embedding models. F1 score measures the performance of detecting poisoned samples, while Clean Retention Rate (CRR) evaluates the proportion of clean samples retained after filtering.

(2) Runtime Analysis

Table 3. TrustRAG runtime analysis based on Llama3.1-8B for 100 queries in three different datasets.

(3) Effectiveness of Perplexity-based Detection

Figure 2: (1) The perplexity distribution density plot between clean and malicious documents. And the lines of dashes represent the average perplexity values. (2) The bar plot of ablation study on accuracy in NQ dataset based on the Llama3.1-8B. (3) The bar plot of ablation study on attack success rate in NQ dataset based on the Llama3.1-8B.

(4) Impact of Context Window

Figure 4: (1) The line plot of accuracy between TrustRAG and Vanilla RAG on clean scenario. (2) The line plot of accuracy between TrustRAG and Vanilla RAG on malicious scenario. (3) The line plot of attack successful rate between TrustRAG and Vanilla RAG on malicious scenario. All the context windows set from 5 to 20 and the malicious scenario includes 5 malicious documents.

Reference

If you find our work helpful, please kindly cite our paper:

@article{zhou2025trustrag,
      title={TrustRAG: Enhancing Robustness and Trustworthiness in RAG}, 
      author={Huichi Zhou and Kin-Hei Lee and Zhonghao Zhan and Yue Chen and Zhenhao Li},
      year={2025},
      eprint={2501.00879},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00879}, 
}