MBRL World Models Planning
Production-grade Model-Based Reinforcement Learning system with learned world models (VAE + RSSM) and planning (MCTS, CEM) for sample-efficient decision making in continuous control environments.
MBRL World Models Planning
๐๏ธI am still working on this project!
The Idea:
- Imagine a robot watching 100 videos of a pendulum swinging
- Instead of trying the pendulum 1 million times, it learns a "mental model"
- Then it uses this mental model to plan what to do (like humans thinking ahead)
What It Does?
- Takes pixel images from a camera
- Learns a compressed world model (learns how the world works)
- Plans ahead without real interactions (imagines future in its head)
- Executes the best action in the real world
World Model Architecture
INPUT: Camera image (64ร64 pixels)
โ
[1] VAE: Image Compressor
โ Compress 12,288 pixels โ 32 numbers (like JPEG, but smarter)
โ
[2] RSSM: Physics Learner
โ Learns "if you move left, what happens next?"
โ Predicts future 20 steps ahead
โ
[3] Reward Predictor: Value Estimator
โ Predicts "how good is this state?"
โ
[4] Planners: Decision Makers (CEM + MCTS)
โ Imagine 1000 action sequences
โ Pick the best one
โ
OUTPUT: Best action to take in real world
How It Works
Cycle 1:
1. Collect 64 real interactions (random exploration)
2. Train VAE on images โ learn 32D representation
3. Train RSSM on sequences โ learn dynamics
4. Train Reward predictor
5. Plan 1000 imagined rollouts using trained models
6. Execute best-planned action in real world
Cycle 2, 3, 4...:
Same process, but:
- More data collected (256 steps total)
- Better world model (more accurate)
- Better planning (less uncertain)
- Better reward (converges)
Result After 3 Cycles:
- โ Pendulum agent learns to balance
- โ Used only 256 real steps
- โ Imagined 6400+ steps (free)
- โ Competitive with PPO (uses 5000 steps)
Planning in Latent Space
Given a trained world model, the agent plans entirely inside the learned latent space:
- MCTS (Monte Carlo Tree Search): Builds a search tree of imagined futures, selecting actions that maximize expected cumulative reward.
- CEM (Cross-Entropy Method): Iteratively refines action sequences by sampling, evaluating in imagination, and fitting a new distribution to the best candidates.
Both planners operate without any further interaction with the real environment during planning.
Training Pipeline
Real Env โ Collect trajectories โ Encode with VAE
โ
Train RSSM on sequences
โ
Train Reward Model
โ
Plan with MCTS / CEM in latent space
โ
Execute best action โ Repeat
Output Images

VAE Reconstruction

RSSM Latent Comparison

Reward Comparison