Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning1,3, Longtian Qiu1, Xuming He1,2
ShanghaiTech University, Shanghai, China1,
Shanghai Engineering Research Center of Intelligent Vision and Imaging2,
Lingang Laboratory, Shanghai, China3
ICLR 2026

Abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. We propose Wiki-R1, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce controllable curriculum data generation and a curriculum sampling strategy that selects informative samples likely to yield non-zero advantages during RL updates. Experiments on Encyclopedic VQA and InfoSeek demonstrate state-of-the-art results, improving accuracy from 35.5% to 37.1% on Encyclopedic VQA and from 40.1% to 44.1% on InfoSeek.

Introduction

Knowledge-Based Visual Question Answering (KB-VQA) is a challenging multimodal task that requires answering questions about an image by integrating external knowledge. A widely adopted approach is the Retrieval-Augmented Generation (RAG) framework: a retriever first fetches relevant knowledge passages from a large-scale knowledge base (e.g., Wikipedia), and a generator then produces an answer conditioned on this context.

However, the noise in the retrieval system is inherent, and the knowledge base typically consists of structured, encyclopedic information. The model must not only reason over noisy and imperfect external evidence but also comprehend retrieved information presented in a structured, encyclopedic form largely unseen during pretraining. These characteristics position KB-VQA as a challenging downstream task for pretrained MLLMs, one that demands robust reasoning ability and effective domain transfer.

Prior work has pursued two main directions: (1) improving retrieval quality, though retrieval remains inherently noisy; (2) enhancing reasoning to handle imperfect retrieval. Early efforts primarily relied on supervised fine-tuning (SFT), which may have limited reasoning robustness. More recent RL methods such as GRPO and DAPO have shown promise in general RAG settings, but their effectiveness in tasks requiring both multimodal reasoning and cross-domain adaptation, such as KB-VQA, remains largely unexplored.

Motivation: The Sparse Reward Problem

We apply the popular RL algorithm DAPO to KB-VQA and observe that over 80% of samples exhibit zero advantages during training, with overall training accuracy around only 10%. This indicates a severe sparse reward problem exacerbated by the distributional gap between pretraining and KB-VQA. Experiments with ground-truth retrieval confirm that retrieval noise is a significant contributing factor to the sparse reward and ineffective training.

Method

We propose Wiki-R1, a data-generation-based curriculum RL framework that constructs a sequence of training distributions adaptively aligned with the model's evolving capability. Unlike conventional curriculum learning, we generate training data with controllable difficulty rather than selecting from a fixed dataset. The framework consists of two tightly coupled components:

Wiki-R1 pipeline

Wiki-R1 pipeline. Left: controllable curriculum data generation bridges pretraining and KB-VQA distributions. Right: curriculum sampling with observation propagation selects informative samples with non-zero advantages.

β‘  Controllable Curriculum Data Generation

We manipulate the retriever to generate training samples with controllable difficulty via a discrete gap level g ∈ {0, 1, …, G}:

  • Easiest (g=0): Only the ground-truth snippet (k=1, Ξ³=1), closest to pretraining distribution.
  • Intermediate (1 < g < G): k=g candidates with ground truth included, introducing increasing noise.
  • Hardest (g=G): No guaranteed ground truth (Ξ³=0), fully aligned with inference-time distribution.

Gap-Level Schedule: A sliding window of recent w samples tracks training accuracy. When the moving average exceeds threshold Ο„, we promote g β†’ g+1, ensuring gradual exposure to harder distributions.

β‘‘ Curriculum Sampling with Observation Propagation

Generated samples may not always match the intended difficulty. We introduce a curriculum sampling strategy that selects samples likely to yield non-zero advantages. Samples with training accuracy near 0.5 provide the strongest gradient signal.

Observation Propagation: Observed rewards are extremely sparse. We leverage the insight that VQA sample correlations relate to their associated KB articles. We construct a label propagation graph with edge weights reflecting KB article similarity, then propagate observed accuracies to unobserved samples:

Anew = Ξ± Β· K Β· Apred + (1βˆ’Ξ±) Β· A

This ensures effective curriculum sampling even under sparse observations.

Main Results

We evaluate Wiki-R1 on two standard KB-VQA benchmarks: Encyclopedic-VQA (EVQA) and InfoSeek. Unlike prior methods (e.g., ReflectiVA) whose performance is highly sensitive to the retrieval mode, Wiki-R1 consistently achieves strong performance across both benchmarks using a single unified retrieval system.

Table 1: Performance comparison on EVQA and InfoSeek. V. = visual, T. = textual retrieval. Con. = Contriever, Col. = ColBERT V2.
MethodRetrievalEVQAInfoSeekAvg.
ModelTypeSingleAllUn-QUn-EAll
Zero-shot MLLMs
BLIP-2--12.612.412.712.312.512.5
InstructBLIP--11.912.08.97.48.110.1
LLaVA-1.5 7B--16.016.98.38.97.812.4
Qwen-2.5-VL 3B--18.618.826.316.119.619.2
Qwen-2.5-VL 7B--26.626.325.317.219.923.1
GPT-4V--26.928.115.014.314.621.4
Retrieval-Augmented Generation
DPRV+TCLIP ViT-B/32V.+T.29.1---12.4-
RORA-VLMCLIP+GoogleV.+T.-20.325.127.3--
Wiki-LLaVACLIP+Con.T.18.319.628.625.727.123.4
EchoSightEVA-CLIP-8BT.22.421.730.030.730.426.1
EchoSightEVA-CLIP-8BV.26.424.918.019.818.821.9
ReflectiVACLIP ViT-L/14T.24.926.734.532.933.730.2
ReflectiVAEVA-CLIP-8BT.28.029.240.439.840.134.7
ReflectiVAEVA-CLIP-8BV.35.535.528.628.128.331.9
Wiki-R1 3BEVA-CLIP+Col.V.+T.40.435.946.040.342.239.1
Wiki-R1 7BEVA-CLIP+Col.V.+T.41.037.147.842.344.140.6

Generalization to ViQuAE (Zero-shot)

Wiki-R1 substantially outperforms all MLLM baselines and even surpasses the RC semi-oracle configuration on ViQuAE, demonstrating strong cross-dataset generalization.

TypeModelF1EM
RCZero-shot20.9618.06
Few-shot25.4322.07
Semi-oracle44.1040.32
Full-oracle63.1757.55
MLLMLLaVA-v1.515.126.6
Wiki-LLaVA12.721.8
ReflectiVA23.238.1
Wiki-R1 3B53.848.6
OursWiki-R1 7B55.650.3

Oracle Entity Setting

Under the oracle setting where the ground-truth Wikipedia entity is provided, Wiki-R1 shows strong ability to effectively leverage correct retrieval results.

MethodLLMEVQAInfoSeek
KB Article
LLaVA-v1.5Vicuna-7B42.913.8
LLaVA-v1.5LLaMA-3.1-8B54.118.8
KB Passage
Wiki-LLaVALLaMA-3.1-8B46.850.9
ReflectiVALLaMA-3.1-8B75.257.6
Wiki-R1Qwen-2.5-3B68.565.3
Wiki-R1Qwen-2.5-7B69.268.2

Ablation Study

We progressively add each component to validate their contributions. Naive SFT yields limited improvements, while DAPO achieves substantial gains. Our data curriculum further enhances DAPO, especially on the noisier EVQA. Directly applying curriculum sampling alone degrades performance due to observation sparsityβ€”this highlights the necessity of our observation propagation module, which enables sampling to function as intended.

Ablation study on Qwen-2.5-VL 3B. βœ“ = enabled, βœ— = disabled.
MethodData Cur.Samp. Cur.Obs. Prop.EVQA (Single)EVQA (All)InfoSeek (Un-Q)InfoSeek (Un-E)InfoSeek (All)
Zero-shot---18.618.826.316.119.6
SFT---21.625.138.724.929.5
SFTβœ“---34.4--32.1
DAPOβœ—βœ—βœ—35.931.444.939.841.5
βœ“βœ—βœ—39.434.546.941.143.0
βœ“βœ“βœ—36.432.145.237.340.0
Wiki-R1βœ“βœ“βœ“40.435.946.040.342.2

Training Dynamics & Efficiency

Efficiency of Observation Propagation

Observation propagation significantly decreases skipped trajectories (those with zero reward signal) during training, improving RL optimization efficiency and overall effectiveness.

Training Dynamics Visualization

DAPO shows early rapid improvement but degrades on EVQA due to overfitting on easier InfoSeek data. Wiki-R1 with curriculum training achieves stable improvements on both benchmarks; best performance emerges at the highest curriculum difficulty level.

Training Efficiency

Wiki-R1 requires substantially fewer training samples (40k total vs. 900k–5.4M for baselines) while achieving superior performance. The training takes only 36 A100 GPU-hours for 3B and 48 hours for 7B.

Training cost comparison. Time in A100 GPU-hours.
MethodFT RetrievalFT Generation#Samples (EVQA)#Samples (InfoSeek)Time
Wiki-LLaVAβœ—βœ“916k903k~75
Echosightβœ“βœ—916k903k40
ReflectiVAβœ—βœ“2.9M2.5M~1,688
Wiki-R1 (3B)βœ—βœ“20k20k36
Wiki-R1 (7B)βœ—βœ“20k20k48

Hyperparameter Sensitivity & Robustness

We evaluate the sensitivity of two key hyperparameters: the curriculum gap threshold Ο„ and the observation-propagation smoothing factor Ξ±. The model converges to similar final accuracy within the explored intervals, confirming that Wiki-R1 is robust to hyperparameter variations. We also verify experimental reliability with three independent runs, showing consistent final performance.

BibTeX

@inproceedings{ningwiki,
  title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum},
  author={Ning, Shan and Qiu, Longtian and He, Xuming},
  booktitle={The Fourteenth International Conference on Learning Representations}
}