Name: 180: Reinforcement Learning
Description: Programming Throwdown · Episode · 2025

Intro topic: Grills

News/Links:

Book of the Show

Patrick:
- The Player of Games (Ian M Banks)
  - https://a.co/d/1ZpUhGl (non-affiliate)
Jason:
- Basic Roleplaying Universal Game Engine
  - https://amzn.to/3ES4p5i

Tool of the Show

Topic: Reinforcement Learning

Three types of AI
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
Online vs Offline RL
Optimization algorithms
- Value optimization
  - SARSA
  - Q-Learning
- Policy optimization
  - Policy Gradients
  - Actor-Critic
  - Proximal Policy Optimization
Value vs Policy Optimization
- Value optimization is more intuitive (Value loss)
- Policy optimization is less intuitive at first (policy gradients)
- Converting values to policies in deep learning is difficult
Imitation Learning
- Supervised policy learning
- Often used to bootstrap reinforcement learning
Policy Evaluation
- Propensity scoring versus model-based
Challenges to training RL model
- Two optimization loops
  - Collecting feedback vs updating the model
- Difficult optimization target
  - Policy evaluation
RLHF & GRPO

★ Support this podcast on Patreon ★