Neural Math Rock: Dual-Backbone Acoustic and Affective Analysis Pipeline

This repository hosts an advanced deep learning framework specialized in the acoustic, structural, and affective analysis of Math Rock and Midwest Emo music. By integrating raw waveform transformers and contrastive audio-language pre-trained encoders, the system extracts high-fidelity representations to decode intricate musical arrangements and complex emotional states.

Project Objectives

The framework is engineered to handle the signature complexities of math rock—such as highly technical guitar tapping, syncopated drum patterns, non-standard time signatures, and sudden dynamic shifts:

Multi-Task Classification: Simultaneous prediction of Emotion (28 multi-label classes), Aesthetic Vibe (4 classes), Dynamic Intensity (3 classes), and Tempo (3 classes).
Acoustics-to-Affect Mapping: Native modeling of raw audio signals to capture the subtle interactions between complex musical textures and implicit vocal deliveries.
Multimodal Integration Ready: Architecture optimized for late-fusion coupling with cross-lingual semantic text models (e.g., XLM-RoBERTa) for hybrid audio-lyrical downstream analysis.

Technical Architecture

The model discards legacy 2D-CNN log-mel spectrogram approaches in favor of a fully unfrozen dual-transformer backbone that processes raw audio waveforms natively through an integrated downstream projection and late-fusion network.

                      +------------------+
                      |  Raw Audio Wave  |
                      +--------+---------+
                               |
         +---------------------+---------------------+
         |                                           |
         v                                           v
+-----------------+                         +-----------------+
|   WavLM-Base    |                         |    CLAP Audio   |
| (Speech/Vocal)  |                         |  (Acoustic/Sfx) |
+--------+--------+                         +--------+--------+
         |                                           |
         v                                           v
+-----------------+                         +-----------------+
| Linear Project  |                         | Linear Project  |
|   (768 -> 512)  |                         |   (768 -> 512)  |
+--------+--------+                         +--------+--------+
         |                                           |
         +---------------------+---------------------+
                               |
                               v
                      +-----------------+
                      |  Feature Concat |
                      |    (1024-dim)   |
                      +--------+--------+
                               |
                               v
                      +-----------------+
                      | LayerNorm + Tanh|
                      +--------+--------+
                               |
         +---------------------+---------------------+---------------------+
         |                     |                     |                     |
         v                     v                     v                     v
+-----------------+   +-----------------+   +-----------------+   +-----------------+
|  Emotion Head   |   |    Vibe Head    |   |  Intensity Head |   |   Tempo Head    |
|  (28 Classes)   |   |   (4 Classes)   |   |   (3 Classes)   |   |   (3 Classes)   |
+-----------------+   +-----------------+   +-----------------+   +-----------------+

1. Acoustic Stream (Dual-Backbone Encoders)

Speech & Vocal Intonation Backbone: microsoft/wavlm-base processes raw waveforms to capture structural temporal context, pitch contours, and vocal delivery dynamics.
General Audio & Timbre Backbone: laion/clap-htsat-unfused processes raw waveforms via an unfused HTSAT topology to capture global acoustic textures, instrumental signatures, and overall arrangement timbre.
Downstream Embeddings Layer: Features from both backbones are mapped via independent linear layers (wlm_proj, clp_proj) to a stabilized 512-dimensional subspace before concatenation.

2. Multi-Task Fusion & Head Layers

Fusion Network: Concat embeddings (1024-dim) are bound through a regularized LayerNorm $\rightarrow$ Tanh $\rightarrow$ Dropout(0.3) projection pipeline to prevent intermediate feature exploding.
Classification Heads: Fully connected downstream networks mapping the unified representation to separate multi-class and independent multi-label targets.

Taxonomy Specifications

Affective & Structural Targets

Emotion (28 Classes, Independent Multi-Label): admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral.
Vibe (4 Classes): aggressive, atmospheric, melancholic, technical.
Intensity Level (3 Classes): low, medium, high.
Tempo Classification (3 Classes): slow, moderate, fast.

Hyperparameters & Training Configuration

The framework was trained across 20 epochs using an accelerated multi-task dynamic gradient pipeline:

Optimization: AdamW with discriminative learning rates ($3\times10^{-5}$ for base transformers, $1\times10^{-4}$ for projection blocks and multi-task heads) and a weight decay coefficient of $0.01$.
Loss Framework: Independent Binary Cross-Entropy (BCE) Focal Loss ($\gamma = 2.0$, label smoothing = $0.1$) applied to the multi-label emotion block to stabilize highly imbalanced heads, coupled with weighted Cross-Entropy for categorical tracks.
Regularization: Gradient norm clipping capped at a maximum threshold of $1.0$; mixed-precision training enabled via torch.amp.

How to Use

Because this model uses a custom dual-transformer architecture, the model class must be declared locally prior to initializing the state dictionary checkpoints.

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import WavLMModel, ClapAudioModel

class AudioMathRockModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        # Initialize pretrained dual backbone transformers
        self.wavlm = WavLMModel.from_pretrained("microsoft/wavlm-base")
        self.clap  = ClapAudioModel.from_pretrained("laion/clap-htsat-unfused")

        # Downstream embedding projection layers
        self.wlm_proj = nn.Linear(768, 512)
        self.clp_proj = nn.Linear(768, 512)

        # Non-linear fusion pipeline
        self.fusion = nn.Sequential(
            nn.Linear(1024, 512),
            nn.LayerNorm(512),
            nn.Tanh(),
            nn.Dropout(0.3),
        )

        # Multi-task classification networks
        self.emo_head = nn.Sequential(
            nn.Linear(512, 256), nn.GELU(), nn.Dropout(0.2),
            nn.Linear(256, 28), # 28 independent multi-label emotions
        )
        self.vibe_head = nn.Linear(512, 4)  # 4 acoustic vibes
        self.int_head  = nn.Linear(512, 3)  # 3 intensity classes
        self.tmp_head  = nn.Linear(512, 3)  # 3 tempo classes

    def forward(self, wavlm_values: torch.Tensor, clap_values: torch.Tensor) -> tuple:
        # Extract temporal mean features from WavLM and pooled features from CLAP
        wlm_feats = self.wavlm(wavlm_values).last_hidden_state.mean(dim=1)
        clp_feats = self.clap(clap_values).pooler_output

        # Project features into balanced dimension space
        wlm_p = F.gelu(self.wlm_proj(wlm_feats))
        clp_p = F.gelu(self.clp_proj(clp_feats))

        # Perform feature concatenation and multi-head prediction
        fused = self.fusion(torch.cat([wlm_p, clp_p], dim=-1))
        return self.emo_head(fused), self.vibe_head(fused), self.int_head(fused), self.tmp_head(fused)

# Initialize and load model checkpoints
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AudioMathRockModel().to(device)

checkpoint = torch.load("model.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

Downloads last month: -; Downloads are not tracked for this model. How to track

anggars
/

neural-mathrock