KnowForge-0.6B

Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset — a synthetic benchmark for compositional rule-following and structured reasoning over fabricated rule systems.

The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge.


Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "qox/knowforge-0.6B",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "system",
        "content": (
            "You are given rules for a fictional system that does NOT exist in the real world. "
            "Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. "
            "Show your reasoning inside <think>...</think> tags before giving your final answer."
        ),
    },
    {
        "role": "user",
        "content": (
            "ZELPH RELATIONS:\n"
            "  stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5\n\n"
            "Facts:\n"
            "  energy(gamma) = 3\n"
            "  energy(delta) = 12\n\n"
            "Question: Is delta stronger than gamma?"
        ),
    },
]

outputs = model.generate(
    **tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True),
    max_new_tokens=256,
    do_sample=False,
)
print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True))

Or use the bundled inference.py:

pip install -r requirements.txt
python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?"
from inference import ask
result = ask("ZELPH RELATIONS: ...")
print(result["answer"])     # "yes"
print(result["reasoning"])  # chain-of-thought inside <think>

Task Description

KnowForge presents the model with a fabricated rule system (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules — no world knowledge applies.

Three transform types are covered:

1. linear_to_cyclic

Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences).

"A clock shows 10. Add 5 hours. What time is it?" → 3

2. relation_to_graph

Transitive relation queries over a directed graph of entities.

"A is taller than B. B is taller than C. Is A taller than C?" → yes

3. relation_property_check

Structural property checks on declared relation systems (transitivity, symmetry, etc.).

"Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" → conditional

Each question may require multi-step reasoning and chain-of-thought inside <think>...</think> before the final answer.


Performance

Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set:

Metric Score
final_correct (test) 64.31%
final_correct (adversarial) 66.67%
executor_success (test) 94.81%
transform_acc (test) 99.64%
slot_sem_f1 (test) 0.648

Comparison against TF-IDF baseline:

  • TF-IDF final_correct: 15.21% (test), 10.34% (adversarial)
  • This model: +49.1 pp on test, +56.3 pp on adversarial

Base Model

Qwen3-0.6B (Apache 2.0) — fine-tuned with LoRA on the KnowForge synthetic dataset. The LoRA adapter was merged into the base weights before publishing; this is a self-contained model.


Limitations

  • Synthetic data only. Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated.
  • English and Vietnamese. Dataset contains both; performance may vary by language.
  • Short rule systems. Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy.
  • CPU is slow. Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5–30 s/query depending on hardware). Use a GPU for interactive use.
  • Chain-of-thought required. The model was trained to emit <think>...</think> before answering. Prompts that suppress reasoning may reduce accuracy.
  • No world knowledge grounding. The model will follow stated rules even when they conflict with reality. This is by design.
Downloads last month
18
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for qox/knowforge-0.6B

Quantizations
1 model