Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
Our dual-stream architecture combines bottom-up geometric processing via point cloud transformers with top-down semantic recognition through diffusion-based priors. The cross-attention mechanism allows geometric features to query semantic content, enabling precise prediction of human visual attention on 3D surfaces.
Comparison of Ground Truth saliency maps with our predictions
Ground Truth
Cat
Ours
Cat
Ground Truth
Gorilla
Ours
Gorilla
Ground Truth
Horse
Ours
Horse
Ground Truth
Octopus
Ours
Octopus
Comparison of saliency predictions across different methods
28% improvement in CC over previous state-of-the-art
| Method | CC ↑ | KL-Div ↓ | MSE ↓ |
|---|---|---|---|
| Song et al. | 0.1249 | 0.7034 | 0.3220 |
| Nousias et al. | 0.0570 | 1.9618 | 0.0759 |
| SAL3D model | 0.6616 | 0.3051 | 0.0204 |
| Mesh Mamba | 0.6140 | 0.3067 | - |
| Ours (SemGeo-Attn) | 0.8492 | 0.1638 | 0.0114 |
129% improvement in LCC over MIMO-GAN
| Method | LCC ↑ | AUC ↑ |
|---|---|---|
| DSM | 0.222 | 0.726 |
| MIMO-GAN-A1 | 0.290 | 0.781 |
| MIMO-GAN-A2 | 0.057 | 0.584 |
| MIMO-GAN-A3 | 0.259 | 0.753 |
| MIMO-GAN | 0.267 | 0.761 |
| Ours (SemGeo-Attn) | 0.609 | 0.935 |
49% improvement in Mean LCC over MIMO-GAN-CRF
| Method | Mean LCC ↑ | Std. Dev. ↓ |
|---|---|---|
| Multi-Scale Gaussian | 0.131 | 0.265 |
| Diffusion Wavelets | 0.088 | 0.222 |
| Spectral Processing | 0.078 | 0.253 |
| Point Clustering | 0.132 | 0.300 |
| Salient Regions | 0.215 | 0.245 |
| Hilbert-CNN | 0.113 | 0.267 |
| RPCA | 0.199 | 0.251 |
| CfS-CNN | 0.226 | 0.243 |
| MIMO-GAN-CRF | 0.510 | 0.108 |
| Ours (SemGeo-Attn) | 0.762 | 0.093 |
Our RL-based scanpath generator achieves NSS of 2.05 and MultiMatch score of 0.51 on NUS3D, indicating that temporal fixation sequences closely match human viewing behavior. The policy successfully balances saliency-driven attention with spatial exploration and inhibition of return.
@article{semgeoattn2025,
author = {Author One and Author Two and Author Three and Author Four},
title = {Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors},
journal = {arxiv:XXXX.XXXXX},
year = {2025},
}