How to Use DeepSeek Janus-Pro: A Guide to Multimodal Large Language Model (LLM) Supporting Both Text and Image
DeepSeek’s Janus-Pro is an advanced multimodal large language model (LLM) that supports both text and image processing capabilities, making it suitable for complex AI applications requiring diverse data inputs. This guide provides a comprehensive walkthrough for setting up, running, and leveraging Janus-Pro’s powerful features.
Understanding Janus-Pro
Janus-Pro decouples visual encoding into separate pathways while maintaining a unified transformer architecture. This design alleviates conflicts between visual understanding and generation, enhancing the model’s flexibility and performance.
Model Variants and Sequence Lengths
DeepSeek offers several Janus models with varying sizes and capabilities. Below are the current variants:
Model, Sequence Length, Download
Length Janus-1.3B, 4096, 🤗 Hugging Face
JanusFlow-1.3B, 4096, 🤗 Hugging Face
Janus-Pro-1B, 4096, 🤗 Hugging Face
Janus-Pro-7Bm 4096, 🤗 Hugging Face
Setting Up Janus-Pro Locally
To run Janus-Pro on your local machine, follow these steps:
Setting Up the Environment
Ensure you have Python version 3.8 or later installed.
1: Create a Virtual Environment
It is recommended to use a virtual environment to isolate your project dependencies. Run the following commands to create and activate a virtual environment:
python3 -m venv janus_env
source janus_env/bin/activate # On Windows, use `janus_env\\Scripts\\activate`wsl --install
2. Clone the Janus Repository
Open your terminal and clone the Janus repository:
git clone https://github.com/deepseek-ai/Janus.git cd Janus
3. Simple Inference Example for Multimodal Understanding
The following example demonstrates how to use Janus-Pro for image and text-based conversations:
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# Specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "<|Assistant|>", "content": ""},
]
# Load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# Run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# Run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Text-to-Image Generation Example
This example demonstrates how to generate images using Janus-Pro:
import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
# Specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": "A stunning princess from Kabul in red, white traditional clothing, blue eyes, brown hair",
},
{"role": "<|Assistant|>", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag
@torch.inference_mode()
def generate(
mmgpt: MultiModalityCausalLM,
vl_chat_processor: VLChatProcessor,
prompt: str,
temperature: float = 1,
parallel_size: int = 16,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
img_size: int = 384,
patch_size: int = 16,
):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).cuda()
for i in range(parallel_size * 2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
for i in range(image_token_num_per_image):
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
hidden_states = outputs.last_hidden_state
logits = mmgpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size // patch_size, img_size // patch_size])
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
os.makedirs('generated_samples', exist_ok=True)
for i in range(parallel_size):
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
PIL.Image.fromarray(visual_img[i]).save(save_path)
# Generate images
generate(
vl_gpt,
vl_chat_processor,
prompt,
)
Best Practices for Deployment
- Model Optimization:
- Utilize mixed precision (e.g.,
torch.bfloat16
) to reduce memory usage while maintaining numerical stability. - Consider using inference-specific frameworks such as TensorRT for better performance.
2. Security and Compliance:
- Secure endpoints and data flow to prevent unauthorized access.
- Implement data sanitization to protect sensitive information.
3. Data Augmentation:
- Enhance image data with transformations like cropping, flipping, and scaling.
- Augment text data by introducing paraphrasing and synonym replacements.
4. Monitoring and Logging:
- Track and log inference requests and responses for better observability.
- Monitor system performance metrics such as memory usage, latency, and throughput.
Conclusion
DeepSeek’s Janus-Pro provides a robust platform for multimodal AI applications, from text processing to image generation. By following this guide and adopting best practices, you can unlock the full potential of this powerful model for your AI projects.