MBRL World Models Planning

Production-grade Model-Based Reinforcement Learning system with learned world models (VAE + RSSM) and planning (MCTS, CEM) for sample-efficient decision making in continuous control environments.

MBRL World Models Planning

๐Ÿ—๏ธI am still working on this project!

The Idea:

  • Imagine a robot watching 100 videos of a pendulum swinging
  • Instead of trying the pendulum 1 million times, it learns a "mental model"
  • Then it uses this mental model to plan what to do (like humans thinking ahead)

What It Does?

  • Takes pixel images from a camera
  • Learns a compressed world model (learns how the world works)
  • Plans ahead without real interactions (imagines future in its head)
  • Executes the best action in the real world

World Model Architecture

INPUT: Camera image (64ร—64 pixels)
  โ†“
[1] VAE: Image Compressor 
  โ†’ Compress 12,288 pixels โ†’ 32 numbers (like JPEG, but smarter)
  โ†“
[2] RSSM: Physics Learner
  โ†’ Learns "if you move left, what happens next?"
  โ†’ Predicts future 20 steps ahead
  โ†“
[3] Reward Predictor: Value Estimator
  โ†’ Predicts "how good is this state?"
  โ†“
[4] Planners: Decision Makers (CEM + MCTS)
  โ†’ Imagine 1000 action sequences
  โ†’ Pick the best one
  โ†“
OUTPUT: Best action to take in real world

How It Works

Cycle 1:

1. Collect 64 real interactions (random exploration)
2. Train VAE on images โ†’ learn 32D representation
3. Train RSSM on sequences โ†’ learn dynamics
4. Train Reward predictor
5. Plan 1000 imagined rollouts using trained models
6. Execute best-planned action in real world

Cycle 2, 3, 4...:

Same process, but:
- More data collected (256 steps total)
- Better world model (more accurate)
- Better planning (less uncertain)
- Better reward (converges)

Result After 3 Cycles:

  • โœ“ Pendulum agent learns to balance
  • โœ“ Used only 256 real steps
  • โœ“ Imagined 6400+ steps (free)
  • โœ“ Competitive with PPO (uses 5000 steps)

Planning in Latent Space

Given a trained world model, the agent plans entirely inside the learned latent space:

  • MCTS (Monte Carlo Tree Search): Builds a search tree of imagined futures, selecting actions that maximize expected cumulative reward.
  • CEM (Cross-Entropy Method): Iteratively refines action sequences by sampling, evaluating in imagination, and fitting a new distribution to the best candidates.

Both planners operate without any further interaction with the real environment during planning.

Training Pipeline

Real Env โ†’ Collect trajectories โ†’ Encode with VAE
                                        โ†“
                              Train RSSM on sequences
                                        โ†“
                              Train Reward Model
                                        โ†“
                    Plan with MCTS / CEM in latent space
                                        โ†“
                        Execute best action โ†’ Repeat

Output Images

VAE Reconstruction
VAE Reconstruction
RSSM Latent Comparison
RSSM Latent Comparison
Reward Comparison
Reward Comparison