WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Shan Ning1,3, Longtian Qiu1, Jiaxuan Sun1, Xuming He1,2
ShanghaiTech University, Shanghai, China1,
Shanghai Engineering Research Center of Intelligent Vision and Imaging2,
Lingang Laboratory, Shanghai, China3
CVPR 2026 Highlight

Abstract

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100Γ— compared with the leading generative model, AutoVER.

Introduction

Open-domain Visual Entity Recognition (VER) aims to identify specific named entities appearing in an image, where the entity space is drawn from encyclopedic knowledge sources such as Wikipedia. VER serves as a critical component in various real-world applications, including information-seeking Visual Question Answering (VQA), animal species recognition, and news content understanding. Despite rapid advances in multimodal large language models (MLLMs), recent studies reveal that VER remains highly challengingβ€”it demands reasoning over fine-grained encyclopedic knowledge and recognizing entities across an extremely large and long-tailed category space, often encompassing millions of candidates.

Recent works on open-domain VER suggest that generative paradigmsβ€”which translate query images into text and then perform text-based entity matching against encyclopedic sourcesβ€”currently outperform contrastive approaches. However, generative methods suffer from several critical limitations:

⏱️ High Inference Latency

Autoregressive decoding requires sequential token generation, causing significant computational overhead compared to parallelizable contrastive encoders.

πŸ” Limited Generalization

Generative VER models often fail to recognize entities not observed during VER training, limiting their open-domain applicability.

πŸ’° High Computational Cost

They typically rely on massive architectures (e.g., AutoVER 13B) and large-scale paired datasets (e.g., REW-47M image–text pairs).

In this work, we revisit the contrastive paradigm for VER and argue that it remains a powerful yet underexplored alternative. Our key insight is that LLM embeddings can encode rich encyclopedic semantics when provided with textual descriptions. By guiding these representations with fine-grained visual cues, we can extract discriminative, entity-level embeddings through lightweight contrastive training. WikiCLIP thus combines the generalization ability of LLM-based representations with the efficiency and scalability of contrastive learning.

Motivation

Generative VER pipelines are accurate but costly. When deployed as intermediate modules within larger pipelines, they can cause slow inference, reduced adaptability, and cumulative error propagation in downstream tasks. WikiCLIP revisits contrastive retrieval with knowledge-rich entity embeddings and vision-guided filtering to achieve a strong efficiency–accuracy tradeoff: 14.49 ms vs. 1569 ms when compared with AutoVER (13B), a nearly 100Γ— speedup.

Method

WikiCLIP pipeline

The Overall Pipeline of WikiCLIP. Given an entity's Wikipedia document, we use CLIP to extract patch-level features from the entity image and an LLM to obtain embeddings of its encyclopedic text description. The Vision-Guided Knowledge Adaptor selects informative text tokens guided by visual features to produce a knowledge-aware entity representation. Hard negative synthesis generates challenging negatives by swapping entity text descriptions.

πŸ”¬ Vision-Guided Knowledge Adaptor (VGKA)

WikiCLIP employs a dual-encoder architecture consisting of a frozen CLIP image encoder for query images and a trainable entity encoder incorporating the VGKA module.

The VGKA uses CLIP to extract patch-level visual features Pe ∈ ℝNpΓ—D from entity images, serving as visual guidance. An LLM encodes the entity's textual description into token-level embeddings, which are linearly projected to match the visual feature dimension, yielding Tt ∈ ℝNtΓ—D.

A multi-head cross-attention operation then selects discriminative text information guided by visual features: V' = FA(Pe, Tt, Tt), followed by mean pooling to obtain the final entity embedding v ∈ ℝD. This allows the model to focus on entity-relevant semantics within lengthy texts while suppressing irrelevant information.

⚑ Hard Negative Synthesis

To improve fine-grained entity discrimination, we introduce a hard negative synthesis strategy that creates visually similar yet semantically mismatched negatives.

Step 1 β€” Visual Clustering: We leverage CLIP visual features to construct mini-batches where query images are visually similar, grouping entities that share visual appearance.

Step 2 β€” Text Swapping: For each sample in the visually clustered batch, we generate Nsync synthetic entities composed of the original entity image paired with randomly selected textual descriptions from other entities in the mini-batch.

These synthetic hard negatives selectively replace easy negatives (those with low cosine similarity to the query), forcing the model to capture fine-grained textual distinctions that define entity identity. Only when both steps are combined does the model learn effective fine-grained discrimination.

Efficient Inference: All entity embeddings in the knowledge base can be precomputed and stored offline. At inference time, recognition requires only a single forward pass through the CLIP image encoder and a FAISS similarity searchβ€”unlike generative approaches that rely on expensive autoregressive decoding. This yields an inference latency of just 14.49 ms per query.

Main Results on OVEN

WikiCLIP achieves 31.6 HM on OVEN β€” nearly 3Γ— the previous contrastive SOTA (CLIP2CLIP, 11.5) β€” and surpasses GiT-Large trained on REW-47M with only 1/5 of the tunable parameters.

CategoryMethodsExtra DatasetLatencyTFLOPSUnseenSeenHM
Zero Shot
Zero ShotGPT5-nanoβ€”β€”β€”13.023.716.8
GPT4Vβ€”β€”β€”19.329.823.4
Generative
GenerativePaLI-3Bβ€”β€”β€”6.621.610.1
PaLI-17Bβ€”β€”β€”12.430.617.6
GiT-Large*WebLI-100M83.953.064.213.76.5
GER-ALD*Entity-WebLI83.953.0617.731.522.7
GiT-Large*Entity-WebLI83.953.0616.425.920.1
GiT-Large*REW-47M83.953.0625.136.029.6
AutoVER 7Bβ€”99319.4721.761.532.1
AutoVER 13Bβ€”156924.7424.563.635.6
Contrastive
ContrastiveCLIP ViTL14β€”11.690.075.45.35.4
CLIPFusionβ€”15.930.084.833.68.4
CLIP2CLIPβ€”13.840.0810.512.611.5
WikiCLIP-Sβ€”14.491.9327.036.831.1
WikiCLIP-Lβ€”14.491.9328.535.531.6

Table 1. Comparison with State-of-the-Art on the OVEN Entity Set. * denotes test-split results. Latency measured on A100.

Generalization on E-VQA & INFOSEEK

WikiCLIP achieves SOTA on INFOSEEK without fine-tuning on its training set, and competitive results on E-VQA compared to Echosight (which is explicitly fine-tuned).

INFOSEEK
MethodsFTUnseenSeenOverall
DPRIn-houseβ€”β€”29.6
CLIP I2T*β€”β€”β€”32.0
CLIP I2I*β€”45.646.545.9
EchosightE-VQAβ€”β€”53.2
WikiCLIP-SOVEN58.569.361.2
WikiCLIP-LOVEN60.369.662.7
E-VQA
MethodsFTUnseenSeenOverall
CLIP I2T*β€”β€”β€”3.3
CLIP I2I*β€”14.610.613.3
EchosightE-VQAβ€”β€”36.5
Google Lensβ€”β€”β€”47.4
WikiCLIP-SOVEN27.739.930.7
WikiCLIP-LOVEN30.735.631.9

Efficiency Comparison

  • Only 0.08B tunable parameters β€” no gradients through frozen LLM/CLIP.
  • WikiCLIP-L: 23 h on 8Γ—A100 vs. AutoVER 13B: 247 h.
  • Trained on 1.9M samples, outperforms GiT-Large trained on 47M samples.
  • Inference: 14.49 ms vs. 1569 ms (AutoVER 13B), 108Γ— faster.
MethodParamsTrain TimeLatency
AutoVER 13B13B247 h1569 ms
GiT-Large (REW)0.4Bβ€”83.95 ms
WikiCLIP-S0.08B19 h14.49 ms
WikiCLIP-L0.08B23 h14.49 ms

Ablation Study

Entity Representation & Training Strategy
ImageTextClusterSynthUnseenSeenOverall
Entity Representation
βœ“39.560.444.8
βœ“47.959.150.8
βœ“βœ“56.868.059.7
Training Strategy
βœ“βœ“βœ“56.868.259.7
βœ“βœ“βœ“57.064.658.9
βœ“βœ“βœ“βœ“58.569.361.2
Choice of Encoders
Text EncoderVisual EncoderUnseenSeenOverall
EVA-CLIP 8BCLIP ViTL26.554.633.6
LLaMa3.2 1BCLIP ViTL39.846.941.6
EVA-CLIP 8BEVA-CLIP 8B56.362.458.1
LLaMa3.2 1BEVA-CLIP 8B58.569.361.2

LLM text encoders outperform CLIP text encoders due to richer world knowledge and longer context support.

Analysis & Discussion

πŸ“ Wiki Text Length
Text length

Performance peaks at 256 tokens. Excessive text introduces noise β€” not all Wikipedia text benefits recognition.

πŸ“Š LLM Scale Effect
LLM scale

Scaling LLMs improves unseen accuracy, but gains between 3B and 8B are marginal.

🎯 Seen Category Ratio
Seen ratio

WikiCLIP achieves 56% unseen acc with only 700 seen entities (10%), vs. 58% with full 7,943.

Visualization

Vision-Guided Knowledge Selection

We visualize the attention map of each text token guided by patch-level vision signals. The top-32 text segments with highest attention are highlighted, showing that the VGKA successfully detects discriminative entity features.

Top-K Retrieval Results

Qualitative top-5 retrieval results show WikiCLIP successfully resolves visually ambiguous cases by leveraging textual descriptions for precise entity recognition.

Top-K retrieval results

Top-5 Retrieval Visualization. WikiCLIP retrieves the correct entity even among visually similar candidates.

Hard Negative Visualization

We visualize entity representations with and without hard negative synthesis using t-SNE. Hard negative synthesis leads to more sparse and discriminative representations, confirmed by higher Silhouette Scores.

Cluster visualization
Score distribution

Hard Negative Visualization. (Left) t-SNE of entity representations. (Right) Silhouette score comparison.

Error Case Analysis

We identify three main failure types: (1) Wrong but Relevant β€” predicted entity is semantically related but incorrect; (2) Unrelated Ground Truth β€” GT entity not directly present in the image; (3) Granularity Mismatch β€” prediction at incorrect specificity level.

Error case analysis

Error Case Visualization. Three main types of prediction errors on OVEN.

BibTeX

@article{ning2026wikiclip,
  title={WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition},
  author={Ning, Shan and Qiu, Longtian and Sun, Jiaxuan and He, Xuming},
  journal={arXiv preprint arXiv:2603.09921},
  year={2026}
}