YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Introduction
This repository provides the weight files required for computing sample-level SQSD scores based on Qwen3-8B, as used in the paper "From Parameter Dynamics to Risk Scoring: Quantifying Sample-Level Safety Degradation in LLM Fine-tuning".
Two types of weights are needed to compute SQSD:
- Parameter shift direction weights (Direction): Encode safety-relevant directions in the model's parameter space, used to measure how individual fine-tuning samples affect model safety.
- Model initialization weights (initial-state): Serve as the starting point for SQSD computation. Note: These weights are only required when computing Danger-Projection. For details, please refer to Section 4.3 Parameter Initialization of the paper.
Links
- Paper: https://arxiv.org/abs/2605.04572
- GitHub: https://github.com/Jason-wx/SQSD
Directory Structure
./
├── Direction/ # Parameter shift direction weights
│ ├── Ageis_Danger/ # Danger direction weights
│ ├── Beaver-Danger/ # Danger direction weights
│ └── PKURLHF-10K_Safety/ # Safety direction weights
├── initial-state/ # Model initialization weights
│ └── dolly_ckpt_5850/ # Initial weights (initialized via Danger-Projection)
└── README.md
Direction Folder
The Direction folder contains three sets of direction weights, each extracted from a different dataset, encoding either a safety or danger direction in parameter space:
| Name | Type | Description |
|---|---|---|
| Ageis_Danger | Danger | Danger direction weights extracted from the Aegis dataset |
| Beaver-Danger | Danger | Danger direction weights extracted from the BeaverTails dataset |
| PKURLHF-10K_Safety | Safety | Safety direction weights extracted from the PKU-RLHF dataset |
These direction weights encode safety-relevant parameter shift directions and are a core dependency for computing SQSD scores.
initial-state Folder
The weights in initial-state (dolly_ckpt_5850) represent the model initialization state derived via the Danger-Projection method — specifically, the parameter point obtained by projecting the base model weights along the danger direction. This serves as the reference starting point for subsequent SQSD computation.
⚠️ The paper defines two initialization strategies depending on the projection direction (see Section 4.3 Parameter Initialization):
- Danger direction (drift-enhanced sensitivity): θ_initial = θ_t, initialized from a fine-tuning checkpoint that exhibits high directional sensitivity. The weights provided here (
dolly_ckpt_5850) serve this purpose.- Safety direction (linear-path sensitivity): θ_initial = θ_0 + α*V_safety, initialized by interpolating from the base model along the safety direction vector. No additional checkpoint is required — only the base model weights and the safety direction weights from the
Directionfolder are needed.