A collection of experimental projects for deploying and benchmarking ML/AI workloads on Modal — a serverless GPU computing platform.
test-modal/
├── llm/ # Deploy & benchmark LLMs (vLLM, SGLang)
├── rag/ # Vision RAG with ColPali + Qdrant
├── tts/ # Text-to-Speech with TensorRT-LLM (F5-TTS)
├── agentic_code/ # AI agent (LangGraph) + Figma integration
├── modal-mcp-tools/ # MCP server for Claude/AI tools
├── diffusion/ # Image generation (Flux)
├── test-grpc/ # gRPC & FastAPI server
├── triton/ # NVIDIA Triton inference
└── profiling/ # PyTorch profiling
Deploy and benchmark large LLMs on Modal with multi-GPU support.
- GPT-OSS-120B — vLLM, 1-4 A100-80GB GPUs, OpenAI-compatible API
- Qwen3-235B — AWQ quantization, 4 GPUs
- Benchmark — Compare throughput, latency, TTFT, TPOT between vLLM and SGLang
modal deploy llm/vllm_llm_gpt_oss_120b.py
modal run llm/vllm_benchmark.pyMulti-modal RAG: upload PDFs, generate vision embeddings (ColPali), perform semantic search via Qdrant, and answer queries with OpenAI/Gemini.
- FastAPI + Gradio web UI
- Streaming responses (SSE)
- DeepResearch agent
modal deploy rag/main.pyRequired secrets: openai, googlecloud-secret, qdrant-secret
F5-TTS with TensorRT-LLM + Triton Inference Server on L4 GPU.
modal run tts/trtllm_f5_tts.py # Build & test
modal deploy tts/trtllm_f5_tts.py # Deploy
python tts/test_client_http.py # TestLangGraph-based agent that analyzes Figma designs and generates Python code. Uses Modal Sandbox for safe code execution.
modal run agentic_code/agent.py --question "Your question"MCP server providing a flux_txt2img tool (text-to-image) for Claude and other AI assistants.
python modal-mcp-tools/main.py- Python 3.10+
- Modal CLI
pip install modal
modal token new| Category | Technologies |
|---|---|
| Platform | Modal, CUDA 12.4+ |
| Inference | vLLM, SGLang, TensorRT-LLM, Triton |
| Models | GPT-OSS-120B, Qwen3-235B, F5-TTS, Flux, ColPali |
| Frameworks | LangGraph, FastAPI, Gradio, FastMCP |
| Vector DB | Qdrant |