YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Introduction

This repository provides the weight files required for computing sample-level SQSD scores based on Qwen3-8B, as used in the paper "From Parameter Dynamics to Risk Scoring: Quantifying Sample-Level Safety Degradation in LLM Fine-tuning".

Two types of weights are needed to compute SQSD:

  • Parameter shift direction weights (Direction): Encode safety-relevant directions in the model's parameter space, used to measure how individual fine-tuning samples affect model safety.
  • Model initialization weights (initial-state): Serve as the starting point for SQSD computation. Note: These weights are only required when computing Danger-Projection. For details, please refer to Section 4.3 Parameter Initialization of the paper.

Links

Directory Structure

./
├── Direction/               # Parameter shift direction weights
│   ├── Ageis_Danger/        # Danger direction weights
│   ├── Beaver-Danger/       # Danger direction weights
│   └── PKURLHF-10K_Safety/  # Safety direction weights
├── initial-state/           # Model initialization weights
│   └── dolly_ckpt_5850/     # Initial weights (initialized via Danger-Projection)
└── README.md

Direction Folder

The Direction folder contains three sets of direction weights, each extracted from a different dataset, encoding either a safety or danger direction in parameter space:

Name Type Description
Ageis_Danger Danger Danger direction weights extracted from the Aegis dataset
Beaver-Danger Danger Danger direction weights extracted from the BeaverTails dataset
PKURLHF-10K_Safety Safety Safety direction weights extracted from the PKU-RLHF dataset

These direction weights encode safety-relevant parameter shift directions and are a core dependency for computing SQSD scores.

initial-state Folder

The weights in initial-state (dolly_ckpt_5850) represent the model initialization state derived via the Danger-Projection method — specifically, the parameter point obtained by projecting the base model weights along the danger direction. This serves as the reference starting point for subsequent SQSD computation.

⚠️ The paper defines two initialization strategies depending on the projection direction (see Section 4.3 Parameter Initialization):

  • Danger direction (drift-enhanced sensitivity): θ_initial = θ_t, initialized from a fine-tuning checkpoint that exhibits high directional sensitivity. The weights provided here (dolly_ckpt_5850) serve this purpose.
  • Safety direction (linear-path sensitivity): θ_initial = θ_0 + α*V_safety, initialized by interpolating from the base model along the safety direction vector. No additional checkpoint is required — only the base model weights and the safety direction weights from the Direction folder are needed.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Wxxxx/SQSD