logo
episode-header-image
Nov 2024
47m 6s

Code Generation & Synthetic Data With Lo...

Neil Leiser
About this episode

Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .

In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.

We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.

Loubna also shares career mistakes, advice and her take on the future of developers and code generation. 

If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.

Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia

StarCoder blog post: https://huggingface.co/blog/starcoder

Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/

Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/  

---

(00:00) - Intro

(02:00) - How Loubna Got Into Data & AI

(03:57) - Internship at Hugging Face

(06:21) - Building A Code Generation Model: StarCoder

(12:14) - Data Filtering Techniques for LLMs

(18:44) - Training StarCoder

(21:35) - Will GenAI Replace Developers? 

(25:44) - Synthetic Data Generation & Building Cosmopedia

(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data

(43:43) - Challenges faced & Career Advice


Up next
Jun 26
Why Data Scientists Don’t Get Hired — And How to Fix It with Dawn Choo #61
Our guest today is Dawn Choo, founder of Interview Master and ex Data Scientist from Amazon and Meta. In our conversation, we first dive into Dawn's past Data Science projects at Amazon and Instagram. She explains how a pet project skyrocketed her career at Amazon and also shares ... Show More
54m 57s
Apr 24
Polars: Fast & Efficient Data Manipulation with Ritchie Vink #60
Our guest today is Ritchie Vink, CEO & Founder of Polars: an open source data manipulation library known for being extremely fast. As of today, polars has over 32k stars on github. In our conversation, Ritchie first explains how Polar which started as a side project evolved to wh ... Show More
42m 46s
Apr 3
How He Developed the World's Best Search Agent with Philippe Mizrahi #59
Our guest is Philippe Mizrahi, CEO of Linkup: a french startup building the world's best search agents. In our conversation, Philippe first shares how he got into search by building an internal dataset search tool at Lyft. We then dive into Linkup where Phil explains how lin ... Show More
56m 29s
Recommended Episodes
Feb 2025
OpenAI researcher on why soft skills are the future of work | Karina Nguyen (Research at OpenAI, ex-Anthropic)
Karina Nguyen leads research at OpenAI, where she’s been pivotal in developing groundbreaking products like Canvas, Tasks, and the o1 language model. Before OpenAI, Karina was at Anthropic, where she led post-training and evaluation work for Claude 3 models, created a document up ... Show More
1h 14m
Feb 2025
The Future of Data Engineering: AI, LLMs, and Automation
Summary In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large langua ... Show More
59m 39s
Apr 12
Simplifying Data Pipelines with Durable Execution
Summary In this episode of the Data Engineering Podcast Jeremy Edberg, CEO of DBOS, about durable execution and its impact on designing and implementing business logic for data systems. Jeremy explains how DBOS's serverless platform and orchestrator provide local resilience and r ... Show More
39m 49s
Aug 2024
Metrics Driven Development
How do you systematically measure, optimize, and improve the performance of LLM applications (like those powered by RAG or tool use)? Ragas is an open source effort that has been trying to answer this question comprehensively, and they are promoting a “Metrics Driven Development” ... Show More
42m 12s
Jan 2025
Breaking Down Data Silos: AI and ML in Master Data Management
Summary In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data ... Show More
57m 30s
Nov 2024
scikit-learn & data science you own
We are at GenAI saturation, so let’s talk about scikit-learn, a long time favorite for data scientists building classifiers, time series analyzers, dimensionality reducers, and more! Scikit-learn is deployed across industry and driving a significant portion of the “AI” that is ac ... Show More
52m 2s
Jul 2024
#225 The Full Stack Data Scientist with Savin Goyal, Co-Founder & CTO at Outerbounds
The role of the data scientist is changing. Some organizations are splitting the role into more narrowly focused jobs, while others are broadening it. The latter approach, known as the Full Stack Data Scientist, is derived from the concept of a full stack software engineer, with ... Show More
48m 44s
Jul 2024
#229 Inside Meta's Biggest and Best Open-Source AI Model Yet with Thomas Scialom, Co-Creator of Llama3
Meta has been at the absolute edge of the open-source AI ecosystem, and with the recent release of Llama 3.1, they have officially created the largest open-source model to date. So, what's the secret behind the performance gains of Llama 3.1? What will the future of open-source A ... Show More
39m 23s
Mar 2025
How AI Is Replacing Entire Dev Teams in 2025 | Vibe Coding EXPLAINED
Episode 51: Is it really possible to rebuild an entire website using A.I.? Matt Wolfe (https://x.com/mreflow) and Nathan Lands (https://x.com/NathanLands) dive into the evolving world of AI-driven development, sharing their insights on the latest buzzword, vibe coding. In this ep ... Show More
45m 29s
Apr 22
Latest ChatGPT Updates Explained: Memory, o3 & 04-mini, 4.1, Social Media Rumors
Episode 55: Confused about all the new OpenAI model names like 4.5, 4.1, o3, 04-mini, and the new “memory” feature? Matt Wolfe (https://x.com/mreflow) and Nathan Lands (https://x.com/NathanLands) are here to demystify the whirlwind of recent ChatGPT updates so you know exactly wh ... Show More
43m 43s