DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations and Bayesian Estimation

ShanghaiTech University, Shanghai, China1,
Shanghai Engineering Research Center of Intelligent Vision and Imaging2,
Lingang Laboratory, Shanghai, China3
TMLR 2025

Abstract

Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient.

Motivation

First research result visualization

We observe that vanilla DPO methods trained on existing pairwise preference data often lead to noticeable degradation in general multimodal capabilities. We attribute this limitation to an imbalance between easy and hard samples in the training data as illustrated above.

Multimodal Preference Optimization Analysis

First research result visualization

Through empirical analysis, we demonstrate that models exhibit a tendency to overfit to simpler training samples, while progressively reducing their effective learning from harder instances>. This phenomenon is particularly pronounced in pairwise training paradigms like DPO. This overfitting behavior ultimately compromises model performance when applied to diverse real-world scenarios. We substantiate these findings with quantitative evidence drawn from training dynamics and reward trend analyses.

Method

Data Difficulty Estimation

First research result visualization

The key challenge in evaluating the difficulty of preference data is the lack of explicit supervision. To address this, we propose a lightweight, training-free strategy that leverages pre-trained contrastive and generative VLMs> to estimate sample difficulty from complementary perspectives.

Distribution-aware Voting Fusion

First research result visualization

To this end, we evaluate the pairwise DPO data from two perspectives, resulting in two difficulty scores. We propose a data-driven voting strategy to adaptively combine the difficulty scores based on the preference classification results.

Difficulty-aware Training

First research result visualization

After estimating the difficulty of the preference pairs, we obtain a robust score that reflects the difficulty of the pairwise DPO data. we perform difficulty-aware DPO training by adaptively calibrating the β. This approach allows us to adjust the weight of each training sample, reducing overfitting on easy samples compared to standard DPO training.

Experiments

To comprehensively evaluate the impact of preference optimization on MLLMs, we select two types of benchmarks. Hallucination benchmarks measure the model’s ability to reduce factual errors, which is the primary goal of multimodal preference alignment. Comprehensive benchmarks assess general multimodal capabilities, ensuring that improvements in hallucination do not come at the cost of overall performance.

Visualization

First research result visualization

BibTeX

@article{qiu2026dpo,
  title={DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations},
  author={Qiu, Longtian and Ning, Shan and Zhang, Chuyu and Sun, Jiaxuan and He, Xuming},
  journal={arXiv preprint arXiv:2601.00623},
  year={2026}
}