face_multitask_v2

A single multi-task convolutional model for facial behavior analysis, used by py-feat's Detectorv2. From one face crop it jointly predicts action units, categorical emotion, valence/arousal, eye gaze, a 478-point face mesh, 6-DoF head pose, and 52 MediaPipe/ARKit blendshapes.

Backbone: ConvNeXt-V2 Tiny (FCMAE + IN-22k/IN-1k pretrained)
Heads: AU graph (AFG/FGG/SC) + unified-feature emotion/V-A and gaze heads + landmark, pose, and blendshape regression heads
Params: ~30M · Input: 224×224 RGB (from a 256×256 face crop)
File: face_multitask_v2.safetensors (safetensors; ModelV2Config JSON in the file metadata)

Outputs

Task	Output	Notes
Action Units	20 probabilities [0,1]	AU01,02,04,05,06,07,09,10,11,12,14,15,17,20,23,24,25,26,28,43
Emotion	7-class softmax	Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger
Valence / Arousal	2 × [−1,1]	tanh
Gaze	(yaw, pitch) radians	head-centric; yaw+ = right, pitch+ = up
Face mesh	478 × (x,y,z)	MediaPipe topology, chip-pixel coords (z = relative depth)
Head pose	(yaw, pitch, roll, tx, ty, tz)	radians / pixels
68 landmarks	derived	dlib-68 subset sampled from the 478 mesh
Blendshapes	52 coefficients [0,1]	MediaPipe/ARKit standard names (browInnerUp, jawOpen, mouthSmileLeft, …)

Benchmarks (held-out, file-verified — v2.5 deployed checkpoint)

Task	Dataset	Metric	Score
AU	DISFA+ (12-AU, Cheong protocol)	macro-F1	0.693
AU	DISFA+ (8-AU subset)	macro-F1	0.740
Emotion	RAF-DB official test (7-cls)	acc / macro-F1	0.910 / 0.885
Emotion	AffectNet val (7-cls, drop Contempt)	acc / macro-F1	0.616 / 0.612
Valence/Arousal	Aff-Wild2 official validation	CCC (V / A)	0.852 / 0.799
Gaze	MPIIGaze (leave-subject-out)	mean angular err	7.05°
Gaze	Gaze360 (held-out split)	mean angular err	12.89°

Notes: Gaze numbers are now leave-subject-out held-out (honest generalization); Numbers are from the deployed checkpoint (v25c_release_ep14), weight-verified against the published .safetensors.

Usage

from feat import Detectorv2
detector = Detectorv2(device="cuda")
fex = detector.detect("image.jpg")   # returns a py-feat Fex

The model expects a face crop produced by RetinaFace + py-feat's extract_face_from_bbox_torch(frame, bbox, face_size=256, expand_bbox=1.2), then center-cropped to 224 and ImageNet-normalized. Detectorv2 handles this.

License

Research / non-commercial use only. Trained on datasets (AffectNet, DISFA+, RAF-DB, Aff-Wild2, BP4D, etc.) whose licenses restrict use to academic research. The ConvNeXt-V2 backbone is MIT-licensed. Confirm each constituent dataset's terms before any non-research use.

Downloads last month: -; Downloads are not tracked for this model. How to track