FlashRT GEMM and FP8 Quant Epilogues
Fused GEMM and FP8 quantization epilogue kernels from FlashRT.
The main current surface is the post-GEMM FP8 quantization epilogue slice: BF16 input plus optional BF16 bias, GELU(tanh), and FP8 e4m3 quantized output, plus a per-channel BF16 scaling and FP8 quantization primitive.
The package also exposes BF16 GEMM wrappers using cuBLASLt fused bias and GELU epilogues. These wrappers are shape-sensitive and should be evaluated for the target workload before promotion.
Planned Features
- FP8 GEMM with fused bias and activation epilogues.
- NVFP4 GEMM with fused bias and activation epilogues.
- Quantized output epilogues for low-latency inference pipelines.
- Generic APIs for Transformer, VLA, and diffusion model linear blocks.
Current API
bf16_gemm_bias(a, b, bias, out=None)bf16_gemm_bias_gelu(a, b, bias, out=None)bias_gelu_quantize_fp8_static_bf16(input, bias, scale, out=None)channel_scale_quantize_fp8_static_bf16(input, channel_scale, scale, out=None)gelu_quantize_fp8_static_bf16(input, scale, out=None)
Performance Notes
FP8 quantize epilogue helpers are the strongest current surface across the
local shape suite. BF16 GEMM epilogue wrappers are shape-sensitive and should be
evaluated against torch.addmm/gelu(torch.addmm) for target shapes before
promotion.
The v1 performance message for this package should center on the FP8 quantization helpers. GEMM epilogue numbers should be reported per shape, not as a broad claim.
Hardware
CUDA GPUs with BF16 tensor core support are expected for the GEMM path. FP8
output helpers additionally require PyTorch and hardware/runtime support for
torch.float8_e4m3fn.
Validation
Validated HF builder targets currently include
torch211-cxx11-cu128-x86_64-linux and
torch211-cxx11-cu126-x86_64-linux, and
torch211-cxx11-cu130-x86_64-linux. Builder ABI checks passed for all three
variants. Host-side correctness smoke passed on an RTX 5090 with PyTorch
2.9.1+cu128. See VALIDATION.md for the full record and remaining gaps.
See examples/fp8_quant_epilogue_block.py for a minimal HF-style module using
the FP8 quantization epilogue helpers.
- Downloads last month
- 2
- OS
- linux
- Arch
- x86_64





