How to Use DeepSeek Janus-Pro: A Guide to Multimodal Large Language Model (LLM) Supporting Both Text and Image

4 min readFeb 10, 2025

DeepSeek’s Janus-Pro is an advanced multimodal large language model (LLM) that supports both text and image processing capabilities, making it suitable for complex AI applications requiring diverse data inputs. This guide provides a comprehensive walkthrough for setting up, running, and leveraging Janus-Pro’s powerful features.

Understanding Janus-Pro

Janus-Pro decouples visual encoding into separate pathways while maintaining a unified transformer architecture. This design alleviates conflicts between visual understanding and generation, enhancing the model’s flexibility and performance.

huggingface.co

Model Variants and Sequence Lengths

DeepSeek offers several Janus models with varying sizes and capabilities. Below are the current variants:

Model, Sequence Length, Download
Length Janus-1.3B, 4096, 🤗 Hugging Face
JanusFlow-1.3B, 4096, 🤗 Hugging Face
Janus-Pro-1B, 4096, 🤗 Hugging Face
Janus-Pro-7Bm 4096, 🤗 Hugging Face

Setting Up Janus-Pro Locally

To run Janus-Pro on your local machine, follow these steps:

Setting Up the Environment

Ensure you have Python version 3.8 or later installed.

1: Create a Virtual Environment

It is recommended to use a virtual environment to isolate your project dependencies. Run the following commands to create and activate a virtual environment:

python3 -m venv janus_env
source janus_env/bin/activate  # On Windows, use `janus_env\\Scripts\\activate`wsl --install

2. Clone the Janus Repository

Open your terminal and clone the Janus repository:

git clone https://github.com/deepseek-ai/Janus.git cd Janus

3. Simple Inference Example for Multimodal Understanding

The following example demonstrates how to use Janus-Pro for image and text-based conversations:

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# Specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# Load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# Run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# Run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Text-to-Image Generation Example

This example demonstrates how to generate images using Janus-Pro:

import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

# Specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "<|User|>",
        "content": "A stunning princess from Kabul in red, white traditional clothing, blue eyes, brown hair",
    },
    {"role": "<|Assistant|>", "content": ""},
]

sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation,
    sft_format=vl_chat_processor.sft_format,
    system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag

@torch.inference_mode()
def generate(
    mmgpt: MultiModalityCausalLM,
    vl_chat_processor: VLChatProcessor,
    prompt: str,
    temperature: float = 1,
    parallel_size: int = 16,
    cfg_weight: float = 5,
    image_token_num_per_image: int = 576,
    img_size: int = 384,
    patch_size: int = 16,
):
    input_ids = vl_chat_processor.tokenizer.encode(prompt)
    input_ids = torch.LongTensor(input_ids)

    tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).cuda()
    for i in range(parallel_size * 2):
        tokens[i, :] = input_ids
        if i % 2 != 0:
            tokens[i, 1:-1] = vl_chat_processor.pad_id

    inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)

    generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()

    for i in range(image_token_num_per_image):
        outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
        hidden_states = outputs.last_hidden_state

        logits = mmgpt.gen_head(hidden_states[:, -1, :])
        logit_cond = logits[0::2, :]
        logit_uncond = logits[1::2, :]

        logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
        probs = torch.softmax(logits / temperature, dim=-1)

        next_token = torch.multinomial(probs, num_samples=1)
        generated_tokens[:, i] = next_token.squeeze(dim=-1)

        next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
        img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
        inputs_embeds = img_embeds.unsqueeze(dim=1)

    dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size // patch_size, img_size // patch_size])
    dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)

    dec = np.clip((dec + 1) / 2 * 255, 0, 255)

    visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
    visual_img[:, :, :] = dec

    os.makedirs('generated_samples', exist_ok=True)
    for i in range(parallel_size):
        save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
        PIL.Image.fromarray(visual_img[i]).save(save_path)

# Generate images
generate(
    vl_gpt,
    vl_chat_processor,
    prompt,
)

Best Practices for Deployment

Model Optimization:

Utilize mixed precision (e.g., torch.bfloat16) to reduce memory usage while maintaining numerical stability.
Consider using inference-specific frameworks such as TensorRT for better performance.

2. Security and Compliance:

Secure endpoints and data flow to prevent unauthorized access.
Implement data sanitization to protect sensitive information.

3. Data Augmentation:

Enhance image data with transformations like cropping, flipping, and scaling.
Augment text data by introducing paraphrasing and synonym replacements.

4. Monitoring and Logging:

Track and log inference requests and responses for better observability.
Monitor system performance metrics such as memory usage, latency, and throughput.

Conclusion

DeepSeek’s Janus-Pro provides a robust platform for multimodal AI applications, from text processing to image generation. By following this guide and adopting best practices, you can unlock the full potential of this powerful model for your AI projects.