Neural Math Rock: Dual-Backbone Acoustic and Affective Analysis Pipeline
This repository hosts an advanced deep learning framework specialized in the acoustic, structural, and affective analysis of Math Rock and Midwest Emo music. By integrating raw waveform transformers and contrastive audio-language pre-trained encoders, the system extracts high-fidelity representations to decode intricate musical arrangements and complex emotional states.
Project Objectives
The framework is engineered to handle the signature complexities of math rock—such as highly technical guitar tapping, syncopated drum patterns, non-standard time signatures, and sudden dynamic shifts:
- Multi-Task Classification: Simultaneous prediction of Emotion (28 multi-label classes), Aesthetic Vibe (4 classes), Dynamic Intensity (3 classes), and Tempo (3 classes).
- Acoustics-to-Affect Mapping: Native modeling of raw audio signals to capture the subtle interactions between complex musical textures and implicit vocal deliveries.
- Multimodal Integration Ready: Architecture optimized for late-fusion coupling with cross-lingual semantic text models (e.g., XLM-RoBERTa) for hybrid audio-lyrical downstream analysis.
Technical Architecture
The model discards legacy 2D-CNN log-mel spectrogram approaches in favor of a fully unfrozen dual-transformer backbone that processes raw audio waveforms natively through an integrated downstream projection and late-fusion network.
+------------------+
| Raw Audio Wave |
+--------+---------+
|
+---------------------+---------------------+
| |
v v
+-----------------+ +-----------------+
| WavLM-Base | | CLAP Audio |
| (Speech/Vocal) | | (Acoustic/Sfx) |
+--------+--------+ +--------+--------+
| |
v v
+-----------------+ +-----------------+
| Linear Project | | Linear Project |
| (768 -> 512) | | (768 -> 512) |
+--------+--------+ +--------+--------+
| |
+---------------------+---------------------+
|
v
+-----------------+
| Feature Concat |
| (1024-dim) |
+--------+--------+
|
v
+-----------------+
| LayerNorm + Tanh|
+--------+--------+
|
+---------------------+---------------------+---------------------+
| | | |
v v v v
+-----------------+ +-----------------+ +-----------------+ +-----------------+
| Emotion Head | | Vibe Head | | Intensity Head | | Tempo Head |
| (28 Classes) | | (4 Classes) | | (3 Classes) | | (3 Classes) |
+-----------------+ +-----------------+ +-----------------+ +-----------------+
1. Acoustic Stream (Dual-Backbone Encoders)
- Speech & Vocal Intonation Backbone:
microsoft/wavlm-baseprocesses raw waveforms to capture structural temporal context, pitch contours, and vocal delivery dynamics. - General Audio & Timbre Backbone:
laion/clap-htsat-unfusedprocesses raw waveforms via an unfused HTSAT topology to capture global acoustic textures, instrumental signatures, and overall arrangement timbre. - Downstream Embeddings Layer: Features from both backbones are mapped via independent linear layers (
wlm_proj,clp_proj) to a stabilized 512-dimensional subspace before concatenation.
2. Multi-Task Fusion & Head Layers
- Fusion Network: Concat embeddings (1024-dim) are bound through a regularized
LayerNorm$\rightarrow$Tanh$\rightarrow$Dropout(0.3)projection pipeline to prevent intermediate feature exploding. - Classification Heads: Fully connected downstream networks mapping the unified representation to separate multi-class and independent multi-label targets.
Taxonomy Specifications
Affective & Structural Targets
- Emotion (28 Classes, Independent Multi-Label):
admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral. - Vibe (4 Classes):
aggressive,atmospheric,melancholic,technical. - Intensity Level (3 Classes):
low,medium,high. - Tempo Classification (3 Classes):
slow,moderate,fast.
Hyperparameters & Training Configuration
The framework was trained across 20 epochs using an accelerated multi-task dynamic gradient pipeline:
- Optimization:
AdamWwith discriminative learning rates ($3\times10^{-5}$ for base transformers, $1\times10^{-4}$ for projection blocks and multi-task heads) and a weight decay coefficient of $0.01$. - Loss Framework: Independent Binary Cross-Entropy (BCE) Focal Loss ($\gamma = 2.0$, label smoothing = $0.1$) applied to the multi-label emotion block to stabilize highly imbalanced heads, coupled with weighted Cross-Entropy for categorical tracks.
- Regularization: Gradient norm clipping capped at a maximum threshold of $1.0$; mixed-precision training enabled via
torch.amp.
How to Use
Because this model uses a custom dual-transformer architecture, the model class must be declared locally prior to initializing the state dictionary checkpoints.
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import WavLMModel, ClapAudioModel
class AudioMathRockModel(nn.Module):
def __init__(self) -> None:
super().__init__()
# Initialize pretrained dual backbone transformers
self.wavlm = WavLMModel.from_pretrained("microsoft/wavlm-base")
self.clap = ClapAudioModel.from_pretrained("laion/clap-htsat-unfused")
# Downstream embedding projection layers
self.wlm_proj = nn.Linear(768, 512)
self.clp_proj = nn.Linear(768, 512)
# Non-linear fusion pipeline
self.fusion = nn.Sequential(
nn.Linear(1024, 512),
nn.LayerNorm(512),
nn.Tanh(),
nn.Dropout(0.3),
)
# Multi-task classification networks
self.emo_head = nn.Sequential(
nn.Linear(512, 256), nn.GELU(), nn.Dropout(0.2),
nn.Linear(256, 28), # 28 independent multi-label emotions
)
self.vibe_head = nn.Linear(512, 4) # 4 acoustic vibes
self.int_head = nn.Linear(512, 3) # 3 intensity classes
self.tmp_head = nn.Linear(512, 3) # 3 tempo classes
def forward(self, wavlm_values: torch.Tensor, clap_values: torch.Tensor) -> tuple:
# Extract temporal mean features from WavLM and pooled features from CLAP
wlm_feats = self.wavlm(wavlm_values).last_hidden_state.mean(dim=1)
clp_feats = self.clap(clap_values).pooler_output
# Project features into balanced dimension space
wlm_p = F.gelu(self.wlm_proj(wlm_feats))
clp_p = F.gelu(self.clp_proj(clp_feats))
# Perform feature concatenation and multi-head prediction
fused = self.fusion(torch.cat([wlm_p, clp_p], dim=-1))
return self.emo_head(fused), self.vibe_head(fused), self.int_head(fused), self.tmp_head(fused)
# Initialize and load model checkpoints
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AudioMathRockModel().to(device)
checkpoint = torch.load("model.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()