Diffusers documentation
MochiTransformer3DModel
Get started
Pipelines
Adapters
Inference
Inference optimization
Modular Diffusers
Training
Quantization
Model accelerators and hardware
Specific pipeline examples
Resources
API
Main Classes
Modular
Loaders
Models
OverviewAutoModel
ControlNets
Transformers
AceStepTransformer1DModelAllegroTransformer3DModelAuraFlowTransformer2DModelBriaFiboTransformer2DModelBriaTransformer2DModelChromaTransformer2DModelChronoEditTransformer3DModelCogVideoXTransformer3DModelCogView3PlusTransformer2DModelCogView4Transformer2DModelConsisIDTransformer3DModelCosmosTransformer3DModelDiTTransformer2DModelEasyAnimateTransformer3DModelErnieImageTransformer2DModelFlux2Transformer2DModelFluxTransformer2DModelGlmImageTransformer2DModelHeliosTransformer3DModelHiDreamImageTransformer2DModelHunyuanDiT2DModelHunyuanImageTransformer2DModelHunyuanVideo15Transformer3DModelHunyuanVideoTransformer3DModelLatteTransformer3DModelLongCatImageTransformer2DModelLTX2VideoTransformer3DModelLTXVideoTransformer3DModelLumina2Transformer2DModelLuminaNextDiT2DModelMochiTransformer3DModelOmniGenTransformer2DModelOvisImageTransformer2DModelPixArtTransformer2DModelPriorTransformerQwenImageTransformer2DModelSanaTransformer2DModelSanaVideoTransformer3DModelSD3Transformer2DModelSkyReelsV2Transformer3DModelStableAudioDiTModelTransformer2DModelTransformerTemporalModelWanAnimateTransformer3DModelWanTransformer3DModelZImageTransformer2DModel
UNets
VAEs
Pipelines
Schedulers
Internal classes
MochiTransformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in Mochi-1 Preview by Genmo.
The model can be loaded with the following code snippet.
from diffusers import MochiTransformer3DModel
transformer = MochiTransformer3DModel.from_pretrained("genmo/mochi-1-preview", subfolder="transformer", torch_dtype=torch.float16).to("cuda")MochiTransformer3DModel
class diffusers.MochiTransformer3DModel
< source >( patch_size: int = 2 num_attention_heads: int = 24 attention_head_dim: int = 128 num_layers: int = 48 pooled_projection_dim: int = 1536 in_channels: int = 12 out_channels: int | None = None qk_norm: str = 'rms_norm' text_embed_dim: int = 4096 time_embed_dim: int = 256 activation_fn: str = 'swiglu' max_sequence_length: int = 256 )
Parameters
- patch_size (
int, defaults to2) — The size of the patches to use in the patch embedding layer. - num_attention_heads (
int, defaults to24) — The number of heads to use for multi-head attention. - attention_head_dim (
int, defaults to128) — The number of channels in each head. - num_layers (
int, defaults to48) — The number of layers of Transformer blocks to use. - in_channels (
int, defaults to12) — The number of channels in the input. - out_channels (
int, optional, defaults toNone) — The number of channels in the output. - qk_norm (
str, defaults to"rms_norm") — The normalization layer to use. - text_embed_dim (
int, defaults to4096) — Input dimension of text embeddings from the text encoder. - time_embed_dim (
int, defaults to256) — Output dimension of timestep embeddings. - activation_fn (
str, defaults to"swiglu") — Activation function to use in feed-forward. - max_sequence_length (
int, defaults to256) — The maximum sequence length of text embeddings supported.
A Transformer model for video-like data introduced in Mochi.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensorof shape(batch_size, num_channels, height, width)or(batch size, num_vector_embeds - 1, num_latent_pixels)if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_statesinput. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.