KnowForge-0.6B
Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset — a synthetic benchmark for compositional rule-following and structured reasoning over fabricated rule systems.
The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"qox/knowforge-0.6B",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "system",
"content": (
"You are given rules for a fictional system that does NOT exist in the real world. "
"Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. "
"Show your reasoning inside <think>...</think> tags before giving your final answer."
),
},
{
"role": "user",
"content": (
"ZELPH RELATIONS:\n"
" stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5\n\n"
"Facts:\n"
" energy(gamma) = 3\n"
" energy(delta) = 12\n\n"
"Question: Is delta stronger than gamma?"
),
},
]
outputs = model.generate(
**tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True),
max_new_tokens=256,
do_sample=False,
)
print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True))
Or use the bundled inference.py:
pip install -r requirements.txt
python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?"
from inference import ask
result = ask("ZELPH RELATIONS: ...")
print(result["answer"]) # "yes"
print(result["reasoning"]) # chain-of-thought inside <think>
Task Description
KnowForge presents the model with a fabricated rule system (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules — no world knowledge applies.
Three transform types are covered:
1. linear_to_cyclic
Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences).
"A clock shows 10. Add 5 hours. What time is it?" → 3
2. relation_to_graph
Transitive relation queries over a directed graph of entities.
"A is taller than B. B is taller than C. Is A taller than C?" → yes
3. relation_property_check
Structural property checks on declared relation systems (transitivity, symmetry, etc.).
"Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" → conditional
Each question may require multi-step reasoning and chain-of-thought inside <think>...</think> before the final answer.
Performance
Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set:
| Metric | Score |
|---|---|
| final_correct (test) | 64.31% |
| final_correct (adversarial) | 66.67% |
| executor_success (test) | 94.81% |
| transform_acc (test) | 99.64% |
| slot_sem_f1 (test) | 0.648 |
Comparison against TF-IDF baseline:
- TF-IDF final_correct: 15.21% (test), 10.34% (adversarial)
- This model: +49.1 pp on test, +56.3 pp on adversarial
Base Model
Qwen3-0.6B (Apache 2.0) — fine-tuned with LoRA on the KnowForge synthetic dataset. The LoRA adapter was merged into the base weights before publishing; this is a self-contained model.
Limitations
- Synthetic data only. Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated.
- English and Vietnamese. Dataset contains both; performance may vary by language.
- Short rule systems. Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy.
- CPU is slow. Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5–30 s/query depending on hardware). Use a GPU for interactive use.
- Chain-of-thought required. The model was trained to emit
<think>...</think>before answering. Prompts that suppress reasoning may reduce accuracy. - No world knowledge grounding. The model will follow stated rules even when they conflict with reality. This is by design.
- Downloads last month
- 18