episode-header-image

Feb 2025

43m 23s

MLG 033 Transformers

About this episode

Up next

MLA 022 Vibe Coding

Andrej Karpathy coined "vibe coding" in February 2025 - a year later, 41% of all code is AI-generated, agents run multi-hour tasks autonomously, and the developer role has shifted from writing code to orchestrating systems. Links Notes and resources at ocdevel.com/mlg/mla-22 Try ... Show More

MLA 023 Claude Code Components

Claude Code distinguishes itself through a deterministic hook system and model-invoked skills that maintain project consistency better than visual-first tools like Cursor. Its multi-surface architecture allows developers to move sessions between CLI, web sandboxes, and mobile whi ... Show More

MLA 024 Agentic Software Engineering

Agentic engineering shifts the developer role from manual coding to orchestrating AI agents that automate the full software lifecycle from ticket to deployment. Using Claude Code with MCP servers and git worktrees allows a single person to manage the output and quality of an enti ... Show More

Recommended Episodes

High Performance And Low Overhead Graphs With KuzuDB

SummaryIn this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. ... Show More

Soumith Chintala: PyTorch

The Rise of Generative AI Video Tools

Episode 13: What impact will AI-generated content have on the entertainment industry? Matt Wolfe (https://x.com/mreflow) and Nathan Lands (https://x.com/NathanLands) dive into this topic, envisioning a future where AI generates interactive movies and complex gaming worlds with in ... Show More

From RAG to Relational: How Agentic Patterns Are Reshaping Data Architecture

SummaryIn this episode of the AI Engineering Podcast Mark Brooker, VP and Distinguished Engineer at AWS, talks about how agentic workflows are transforming database usage and infrastructure design. He discusses the evolving role of data in AI systems, from traditional models to m ... Show More

Code Generation & Synthetic Data With Loubna Ben Allal #51

Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 . In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can ... Show More

Canva Create 2025 - What's New for Educators? - HoET261

In this exciting crossover episode, Chris Nesi teams up with Leena Marie Saleh (The EdTech Guru) for a detailed look into Canva's latest educational innovations unveiled during Canva Create 2025. Whether you're a teacher, instructional coach, or tech integrator, this episode is p ... Show More

806 : Topical English Vocabulary Lesson With Teacher Tiffani about Digital Art

<p>In today’s episode, you will learn a series of vocabulary words that are connected to a specific topic. This lesson will help you improve your ability to speak English fluently about a specific topic. It will also help you feel more confident in your English abilities.</p><h1> ... Show More

Rendering Revolutions: Chaos founder Vlado Koylazov's Journey from V-Ray to Virtual Production

This podcast episode features Vlado Koylazov, co-founder of Chaos and inventor of the widely-used V-Ray rendering software. Koylazov shares his journey in computer graphics, from his early fascination with the field to the development of V-Ray and the latest innovations at Chaos. ... Show More

Pausing to think about scikit-learn & OpenAI o1

Recently the company stewarding the open source library scikit-learn announced their seed funding. Also, OpenAI released “o1” with new behavior in which it pauses to “think” about complex tasks. Chris and Daniel take some time to do their own thinking about o1 and the contrast to ... Show More

Deepdub’s Ofir Krakowski on Redefining Dubbing from Hollywood to Bollywood - Ep. 202

In the global entertainment landscape, TV show and film production stretches far beyond Hollywood or Bollywood — it's a worldwide phenomenon. However, while streaming platforms have broadened the reach of content, dubbing and translation technology still has plenty of room for gr ... Show More

Links:

Notes and resources at ocdevel.com/mlg/33
3Blue1Brown videos: https://3blue1brown.com/
Try a walking desk stay healthy & sharp while you learn & code
Try Descript audio/video editing with AI power-tools

Background & Motivation

RNN Limitations: Sequential processing prevents full parallelization—even with attention tweaks—making them inefficient on modern hardware.
Breakthrough: "Attention Is All You Need" replaced recurrence with self-attention, unlocking massive parallelism and scalability.

Core Architecture

Layer Stack: Consists of alternating self-attention and feed-forward (MLP) layers, each wrapped in residual connections and layer normalization.
Positional Encodings: Since self-attention is permutation invariant, add sinusoidal or learned positional embeddings to inject sequence order.

Self-Attention Mechanism

Q, K, V Explained:
- Query (Q): The representation of the token seeking contextual info.
- Key (K): The representation of tokens being compared against.
- Value (V): The information to be aggregated based on the attention scores.
Multi-Head Attention: Splits Q, K, V into multiple "heads" to capture diverse relationships and nuances across different subspaces.
Dot-Product & Scaling: Computes similarity between Q and K (scaled to avoid large gradients), then applies softmax to weigh V accordingly.

Masking

Causal Masking: In autoregressive models, prevents a token from "seeing" future tokens, ensuring proper generation.
Padding Masks: Ignore padded (non-informative) parts of sequences to maintain meaningful attention distributions.

Feed-Forward Networks (MLPs)

Transformation & Storage: Post-attention MLPs apply non-linear transformations; many argue they're where the "facts" or learned knowledge really get stored.
Depth & Expressivity: Their layered nature deepens the model's capacity to represent complex patterns.

Residual Connections & Normalization

Residual Links: Crucial for gradient flow in deep architectures, preventing vanishing/exploding gradients.
Layer Normalization: Stabilizes training by normalizing across features, enhancing convergence.

Scalability & Efficiency Considerations

Parallelization Advantage: Entire architecture is designed to exploit modern parallel hardware, a huge win over RNNs.
Complexity Trade-offs: Self-attention's quadratic complexity with sequence length remains a challenge; spurred innovations like sparse or linearized attention.

Training Paradigms & Emergent Properties

Pretraining & Fine-Tuning: Massive self-supervised pretraining on diverse data, followed by task-specific fine-tuning, is the norm.
Emergent Behavior: With scale comes abilities like in-context learning and few-shot adaptation, aspects that are still being unpacked.

Interpretability & Knowledge Distribution

Distributed Representation: "Facts" aren't stored in a single layer but are embedded throughout both attention heads and MLP layers.
Debate on Attention: While some see attention weights as interpretable, a growing view is that real "knowledge" is diffused across the network's parameters.