Instructions to use armaniii/WIBA-Extract-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use armaniii/WIBA-Extract-V1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="armaniii/WIBA-Extract-V1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("armaniii/WIBA-Extract-V1")
model = AutoModelForMultimodalLM.from_pretrained("armaniii/WIBA-Extract-V1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use armaniii/WIBA-Extract-V1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "armaniii/WIBA-Extract-V1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armaniii/WIBA-Extract-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/armaniii/WIBA-Extract-V1

SGLang

How to use armaniii/WIBA-Extract-V1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "armaniii/WIBA-Extract-V1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armaniii/WIBA-Extract-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "armaniii/WIBA-Extract-V1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armaniii/WIBA-Extract-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use armaniii/WIBA-Extract-V1 with Docker Model Runner:
```
docker model run hf.co/armaniii/WIBA-Extract-V1
```

WIBA Claim Topic Extraction (Llama-3-8B, pre-quantized 4-bit)

Topic extraction model: given an argumentative sentence or passage, it generates the topic being argued (a short phrase naming the person, place, thing, entity, or idea at issue), or No Topic if the text is not an argument. The topic may be explicit in the text or implicit and inferred from context.

This is Stage 2 of the WIBA (What Is Being Argued?) argument mining pipeline:

Stage	Task	Model	Type
1. Detect	Is this text an argument?	armaniii/llama-3-8b-argument-detection	LoRA adapter (sequence classification, 2 labels)
2. Extract	What topic is being argued?	this repo	Fine-tuned causal LM (pre-quantized 4-bit)
3. Stance	What position does it take on the topic?	armaniii/llama-stance-classification	LoRA adapter (sequence classification, 3 labels)

📄 Paper: WIBA: What Is Being Argued? A Comprehensive Approach to Argument Mining
💻 Code: github.com/Armaniii/WIBA
🌐 Platform: wiba.dev

What this repo contains (full model, stored 4-bit quantized)

This repo is a complete, self-contained fine-tuned model — no base download, no adapter. But unlike a normal fp16 checkpoint, the weights are stored pre-quantized with bitsandbytes NF4 (the format the WIBA platform serves in production):

File	Purpose
`model-0000*-of-00002.safetensors` + index	~6 GB total. Linear-layer weights as packed 4-bit (uint8) with `absmax`/`quant_map` quantization metadata; embeddings and `lm_head` in float16
`config.json`	Model config including the `quantization_config` (bnb NF4, blocksize 64, compute dtype fp16) that tells transformers how to load the 4-bit weights
`generation_config.json`	Default generation settings
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`	Llama-3 tokenizer

Practical consequences:

bitsandbytes is a hard requirement — the checkpoint cannot be loaded without it.
Do not try to remove/override quantization_config to get fp16: the stored weights themselves are 4-bit packed, so there is no full-precision copy in this repo. To obtain higher-precision weights, load 4-bit first and call model.dequantize() (see below).
VRAM needed is only ~6 GB — the model fits on small GPUs.

Before you start

No gated access needed — unlike the detect and stance stages, this repo is fully self-contained (no Meta base model to download), so there is no license gate, no account, and no token required. The first run downloads ~~6 GB with progress bars, cached afterward in `~~/.cache/huggingface`.

Hardware requirements — pick your setup

Setup	What you need	Speed
GPU (recommended)	NVIDIA GPU with ≥8 GB free VRAM, `pip install bitsandbytes`	fast — this is the wiba.dev production configuration
CPU only	~25 GB free RAM, no GPU; loads 4-bit then dequantizes (see below)	~1–2 min per text on 16 cores

⚠️ Do not run generate() directly on the 4-bit model on a CPU: bitsandbytes' CPU 4-bit kernels are single-threaded and a single sentence takes over an hour (measured). Use the dequantize recipe below instead.

Quickstart — GPU

pip install torch transformers accelerate bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

REPO = "armaniii/llama-3-8b-claim-topic-extraction"

tokenizer = AutoTokenizer.from_pretrained(REPO)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# quantization_config ships in config.json — transformers loads the 4-bit
# weights automatically (~6 GB VRAM)
model = AutoModelForCausalLM.from_pretrained(REPO, device_map="auto", low_cpu_mem_usage=True)
model.eval()

Quickstart — CPU (no GPU)

bitsandbytes is still required (the checkpoint is stored 4-bit), but after loading, dequantize to bfloat16 so generation runs on all CPU cores (verified: ~25 GB RAM peak, then ~1–2 min per text on 16 cores):

model = AutoModelForCausalLM.from_pretrained(REPO, device_map="cpu", low_cpu_mem_usage=True)
model = model.dequantize().to(torch.bfloat16)
model.eval()
torch.set_num_threads(16)   # match your core count

Prompt format (must match training)

The model expects the Llama-3 chat header format with the WIBA topic-extraction system prompt, and the generation cut off after a few tokens (topics are short):

SYSTEM_PROMPT = """You are a helpful assistant that is specialized in a single task. If the sentence provided is an argument, decide what the topic being argued is using the rules and steps below.
Rules: 
1. An argument is a sentence that must contain a claim AND AT LEAST ONE premise(i.e evidence) supporting that assertion or claim. 
2. A claim is the position being taken in the argument.
3. A premise is a statement that provides evidence to support the claim.
4. In order for a sentence to be an argument it must contain a claim AND at least one premise.
5. If the sentence does not contain a claim AND does not provide any premises to support the claim, then it is a non-argument.
6. If the sentence provided is an argument, then there must be a single topic being argued that is regarding a person, place, thing, entity, or abstract idea. The topic being argued may be explicitly stated OR it may be implicit and must be inferred from the context of the argument.  
7. If the sentence provided is a non-argument, then there is no topic being argued. 

Steps:
1. Decide if the sentence provided is an argument or non-argument using the Rules provided.
2. If the sentence is an argument, output only the topic being argued and your task is finished.
3. If the sentence is a non-argument, only output: No Topic and your task is finished.
4. If the sentence provided is a non-argument, then there is no topic being argued and you should only output: No Topic
5. Let us think through the problem step by step carefully following all the rules outlined."""

def extract_topic(text: str) -> str:
    prompt = (
        "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        + SYSTEM_PROMPT
        + "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
        + text
        + "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    )
    enc = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**enc, max_new_tokens=8, pad_token_id=128009)
    return tokenizer.decode(out[0, enc.input_ids.shape[1]:], skip_special_tokens=True).strip()

print(extract_topic("We must act on climate change because temperatures are rising."))
# -> climate change
print(extract_topic("The weather is nice today."))
# -> No Topic
print(extract_topic("Abortion should remain legal because bodily autonomy is a fundamental right."))
# -> abortion

(Outputs above are actual verified predictions, not illustrations.)

The original implementation uses the equivalent pipeline("text-generation", ..., max_new_tokens=8, pad_token_id=128009) and takes the text after the final assistant<|end_header_id|>\n\n marker — the function above does the same thing with generate.

Output

An argumentative input → a short topic phrase (e.g. Climate change, Gun control)
A non-argument input → the literal string No Topic

Batch processing many texts (with a progress bar)

Model downloads show progress bars automatically; generation doesn't, so wrap your loop in tqdm (installed with transformers) exactly as the original WIBA serving code does:

from tqdm import tqdm

texts = ["...", "..."]  # your data
topics = [extract_topic(t) for t in tqdm(texts)]

Getting full-precision weights

The repo stores no fp16 copy, but you can dequantize after loading (needs enough memory for the fp16 model, ~16 GB — this is the same call the CPU quickstart uses):

model = AutoModelForCausalLM.from_pretrained(REPO, device_map="auto")
model = model.dequantize()          # bnb 4-bit -> floating point

Tested configurations

Stack	Versions	Status
Modern (2026)	torch 2.5.1, transformers 5.12.0, accelerate 1.14.0, bitsandbytes 0.49.2	✅ verified (4-bit load, generation, and `dequantize()` path)

Notes:

Without bitsandbytes installed, from_pretrained raises immediately (the checkpoint is pre-quantized).
Attempting to load with the quantization_config removed fails with shape errors (ckpt torch.Size([8388608, 1]) vs model torch.Size([4096, 4096])) — the stored weights really are 4-bit packed.
CPU-only machines: the 4-bit load works (4 GB RAM, bitsandbytes ships a CPU backend) but 4-bit inference on CPU is single-threaded and impractically slow. For CPU inference, load 4-bit, then model.dequantize() and cast to torch.bfloat16. For real use, a CUDA GPU (6 GB VRAM) is the practical choice.
use_fast=False (which the original 2024 serving code passed) is silently ignored on transformers 5.x — slow tokenizers were removed; the default fast tokenizer is correct.

How it's used in the WIBA implementation

In the WIBA serving code, this model backs the /api/extract endpoint at wiba.dev. Texts that Stage 1 classified as Argument are passed here to name the topic; the (text, topic) pair is then passed to Stage 3 (stance classification) to determine whether the argument is in favor of or against that topic. For batch processing the implementation streams prompts through the pipeline with batch_size=2 and left-padding.

Citation

@article{irani2024wiba,
  title={WIBA: What Is Being Argued? A Comprehensive Approach to Argument Mining},
  author={Irani, Arman and Park, Ju Yeon and Esterling, Kevin and Faloutsos, Michalis},
  journal={arXiv preprint arXiv:2405.00828},
  year={2024}
}

Notes

Fine-tuned from meta-llama/Meta-Llama-3-8B (Llama 3 license applies). The weights here are already fine-tuned; the base model is not required.
Internal fine-tune lineage: llama_cte_v3.

Downloads last month: 9

Safetensors

Model size

8B params

Tensor type

F16

F32

Model tree for armaniii/WIBA-Extract-V1

Base model

meta-llama/Meta-Llama-3-8B

Quantized

(277)

this model

Paper for armaniii/WIBA-Extract-V1

WIBA: What Is Being Argued? A Comprehensive Approach to Argument Mining

Paper • 2405.00828 • Published May 1, 2024