Learning Human Visual Attention on 3D Surfaces
through Geometry-Queried Semantic Priors

Conference Name 2025
Soham Pahari 1,   Sandeep Chand Kumain 1,  

1 University of Petroleum and Energy Studies

,

Abstract

Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.

Method Overview

SemGeo-AttentionNet Pipeline

Our dual-stream architecture combines bottom-up geometric processing via point cloud transformers with top-down semantic recognition through diffusion-based priors. The cross-attention mechanism allows geometric features to query semantic content, enabling precise prediction of human visual attention on 3D surfaces.

Qualitative Results

3D Saliency Visualization on SAL3D

Comparison of Ground Truth saliency maps with our predictions

Ground Truth

Cat

Ours

Cat

Ground Truth

Gorilla

Ours

Gorilla

Ground Truth

Horse

Ours

Horse

Ground Truth

Octopus

Ours

Octopus

Visual Comparisons on NUS3D Dataset

Comparison of saliency predictions across different methods

Quantitative Results

SAL3D Dataset

28% improvement in CC over previous state-of-the-art

Method CC ↑ KL-Div ↓ MSE ↓
Song et al.0.12490.70340.3220
Nousias et al.0.05701.96180.0759
SAL3D model0.66160.30510.0204
Mesh Mamba0.61400.3067-
Ours (SemGeo-Attn)0.84920.16380.0114

NUS3D-Saliency Dataset

129% improvement in LCC over MIMO-GAN

Method LCC ↑ AUC ↑
DSM0.2220.726
MIMO-GAN-A10.2900.781
MIMO-GAN-A20.0570.584
MIMO-GAN-A30.2590.753
MIMO-GAN0.2670.761
Ours (SemGeo-Attn)0.6090.935

3DVA Dataset

49% improvement in Mean LCC over MIMO-GAN-CRF

Method Mean LCC ↑ Std. Dev. ↓
Multi-Scale Gaussian0.1310.265
Diffusion Wavelets0.0880.222
Spectral Processing0.0780.253
Point Clustering0.1320.300
Salient Regions0.2150.245
Hilbert-CNN0.1130.267
RPCA0.1990.251
CfS-CNN0.2260.243
MIMO-GAN-CRF0.5100.108
Ours (SemGeo-Attn)0.7620.093

Scanpath Generation

Our RL-based scanpath generator achieves NSS of 2.05 and MultiMatch score of 0.51 on NUS3D, indicating that temporal fixation sequences closely match human viewing behavior. The policy successfully balances saliency-driven attention with spatial exploration and inhibition of return.

2.05
NSS (SAL3D)
1.60
NSS (NUS3D)
0.51
MultiMatch

BibTeX

@article{semgeoattn2025,
    author    = {Author One and Author Two and Author Three and Author Four},
    title     = {Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors},
    journal   = {arxiv:XXXX.XXXXX},
    year      = {2025},
}