Brought to You By:
• Statsig — The unified platform for flags, analytics, experiments, and more.
• Sonar – The makers of SonarQube, the industry standard for automated code review
• WorkOS – Everything you need to make your app enterprise ready.
—
Amazon S3 is one of the largest distributed systems ever built, storing and serving data for a significant portion of the internet. Behind its simple interfaces hides an enormous amount of engineering work, careful tradeoffs, and long-term thinking.
In this episode, I sit down with Mai-Lan Tomsen Bukovec, VP of Data and Analytics at AWS, who has been running Amazon S3 for more than a decade. Mai-Lan shares how S3 operates at extreme scale, what it takes to design for durability and availability across millions of servers, and why building for failure is a core principle.
We also go deep into how AWS approaches correctness using formal methods, how storage tiers and limits shape system design, and why simplicity remains one of the hardest and most important goals at S3’s scale.
—
Timestamps
(00:00) Intro
(01:03) S3’s scale
(03:58) How S3 started
(07:25) Parquet, Iceberg, and S3 tables
(09:46) S3 for developers
(13:37) Why AWS keeps S3 prices low
(17:10) AWS pricing tiers
(19:38) Availability and durability
(26:21) The cost of S3's consistency
(31:22) Automated reasoning and proof of correctness
(35:14) Durability at AWS scale
(39:58) Correlated failure and crash consistency
(43:22) Failure allowances
(46:04) Two opposing principles in S3 design
(49:09) S3’s evolution
(52:21) S3 Vectors
(1:01:16) The 50 TB limit on AWS
(1:07:54) The simplicity principle
(1:10:10) Types of engineers working on S3
(1:14:15) Closing recommendations
—
The Pragmatic Engineer deepdives relevant for this episode:
• Inside Amazon’s engineering culture
• How AWS deals with a major outage
• A Day in the Life of a Senior Manager at Amazon
• What is a Principal Engineer at Amazon? – with Steve Huynh
• Working at Amazon as a software engineer – with Dave Anderson
Amazon papers recommended by Mai-Lan:
• Using lightweight formal methods to validate a key-value storage node in Amazon S3
• Formally verified cloud-scale authorization
• Analyzing metastable failures
—
Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@pragmaticengineer.com.