DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations and Bayesian Estimation
Abstract
Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient.
Motivation
We observe that vanilla DPO methods trained on existing pairwise preference data often lead to noticeable degradation in general multimodal capabilities. We attribute this limitation to an imbalance between easy and hard samples in the training data as illustrated above.
Multimodal Preference Optimization Analysis
Through empirical analysis, we demonstrate that models exhibit a tendency to overfit to simpler training samples, while progressively reducing their effective learning from harder instances>. This phenomenon is particularly pronounced in pairwise training paradigms like DPO. This overfitting behavior ultimately compromises model performance when applied to diverse real-world scenarios. We substantiate these findings with quantitative evidence drawn from training dynamics and reward trend analyses.
Method
Data Difficulty Estimation
The key challenge in evaluating the difficulty of preference data is the lack of explicit supervision. To address this, we propose a lightweight, training-free strategy that leverages pre-trained contrastive and generative VLMs> to estimate sample difficulty from complementary perspectives.
Distribution-aware Voting Fusion
To this end, we evaluate the pairwise DPO data from two perspectives, resulting in two difficulty scores. We propose a data-driven voting strategy to adaptively combine the difficulty scores based on the preference classification results.
Difficulty-aware Training
After estimating the difficulty of the preference pairs, we obtain a robust score that reflects the difficulty of the pairwise DPO data. we perform difficulty-aware DPO training by adaptively calibrating the β. This approach allows us to adjust the weight of each training sample, reducing overfitting on easy samples compared to standard DPO training.
Experiments
To comprehensively evaluate the impact of preference optimization on MLLMs, we select two types of benchmarks. Hallucination benchmarks measure the model’s ability to reduce factual errors, which is the primary goal of multimodal preference alignment. Comprehensive benchmarks assess general multimodal capabilities, ensuring that improvements in hallucination do not come at the cost of overall performance.
Visualization
BibTeX
@article{qiu2026dpo,
title={DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations},
author={Qiu, Longtian and Ning, Shan and Zhang, Chuyu and Sun, Jiaxuan and He, Xuming},
journal={arXiv preprint arXiv:2601.00623},
year={2026}
}