Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning1,3, Longtian Qiu1, Xuming He1,2
ShanghaiTech University, Shanghai, China1,
Shanghai Engineering Research Center of Intelligent Vision and Imaging2,
Lingang Laboratory, Shanghai, China3
ICLR 2026

Abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. We propose Wiki-R1, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model’s evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce controllable curriculum data generation and a curriculum sampling strategy that selects informative samples likely to yield non-zero advantages during RL updates. Experiments on Encyclopedic VQA and InfoSeek demonstrate state-of-the-art results, improving accuracy from 35.5% to 37.1% on Encyclopedic VQA and from 40.1% to 44.1% on InfoSeek.

Motivation

RL for KB-VQA faces severe sparse rewards due to noisy retrieval and the encyclopedic structure of the knowledge base. Wiki-R1 addresses this by creating a learnable curriculum over data difficulty and sampling informative examples that yield effective gradients.

Method

Wiki-R1 pipeline

Wiki-R1 pipeline. Left: controllable curriculum data generation bridges pretraining and KB-VQA distributions. Right: curriculum sampling with observation propagation selects informative samples with non-zero advantages.

Wiki-R1 manipulates the retriever to generate training data with controllable difficulty, then schedules samples using observed rewards and label propagation across related knowledge-base articles. This joint design stabilizes RL optimization under noisy retrieval.

Experiments

On Encyclopedic VQA and InfoSeek, Wiki-R1 reaches 37.1% and 44.1% accuracy, respectively, and achieves 47.8% on InfoSeek’s Unseen-Question split. It also generalizes well to ViQuAE in zero-shot transfer.

Visualization

BibTeX

@inproceedings{ningwiki,
  title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum},
  author={Ning, Shan and Qiu, Longtian and He, Xuming},
  booktitle={The Fourteenth International Conference on Learning Representations}
}