Papers
arxiv:2209.04093

Learning Audio-Visual embedding for Person Verification in the Wild

Published on Oct 26, 2022
Authors:
,
,
,
,
,
,

Abstract

A novel audio-visual person verification approach combines weight-enhanced attentive statistics pooling with cycle consistency and gated attention mechanisms to achieve state-of-the-art performance on VoxCeleb datasets.

AI-generated summary

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2209.04093 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2209.04093 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2209.04093 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.