logo
episode-header-image
Sep 2021
1h 4m

Massively Parallel Data Processing In Py...

Tobias Macey
About this episode

Summary

Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
  • Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Bodo is and the story behind it?
  • What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows?
  • Why have you focused your efforts on the Python language and toolchain?
    • Do you see any potential for expanding into other language communities?
    • What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects?
  • Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC?
    • What are the tradeoffs of HPC vs scale-out distributed systems?
  • Can you describe the technical implementation of the Bodo platform?
    • What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler?
      • How do you handle compiled extensions? (e.g. C/C++/Fortran)
    • What are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation?
  • How do you handle data distribution for scale out computation?
  • What are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization?
  • What are some of the educational challenges that you have run into while working with potential and current customers?
  • What are the most interesting, innovative, or unexpected ways that you have seen Bodo used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo?
  • When is Bodo the wrong choice?
  • What do you have planned for the future of Bodo?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Up next
Jul 6
Foundational Data Engineering At 2Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Recommended Episodes
Mar 2024
#454: Data Pipelines with Dagster
See the full show notes for this episode on the website at talkpython.fm/454 
58m 25s
Mar 2022
Bayesian Machine Learning with Ravin Kumar (Ep. 191)
This is one episode where passion for math, statistics and computers are merged. I have a very interesting conversation with Ravin,  data scientist at Google where he uses data to inform decisions. He has previously worked at Sweetgreen, designing systems that would benefit team ... Show More
31m 12s
Jul 2020
What data transformation library should I use? Pandas vs Dask vs Ray vs Modin vs Rapids (Ep. 112)
In this episode I speak about data transformation frameworks available for the data scientist who writes Python code. The usual suspect is clearly Pandas, as the most widely used library and de-facto standard. However when data volumes increase and distributed algorithms are in p ... Show More
21m 10s
Sep 2022
#382: Apache Superset: Modern Data Exploration Platform
See the full show notes for this episode on the website at talkpython.fm/382 
1h 8m
Feb 2023
Shorten the distance between production data and insight
Modern networked applications generate a lot of data, and every business wants to make the most of that data. Most of the time, that means moving production data through some transformation process to get it ready for the analytics process. But what if you could have in-app analy ... Show More
20m 27s
Jun 2024
Making ETL pipelines a thing of the past
RelationalAI’s first big partner is Snowflake, meaning customers can now start using their data with GenAI without worrying about the privacy, security, and governance hassle that would come with porting their data to a new cloud provider. The company promises it can also add met ... Show More
26m 13s
Jul 2023
#421: Python at Netflix
See the full show notes for this episode on the website at talkpython.fm/421 
1h 4m
Aug 2023
#425: Memray: The endgame Python memory profiler
See the full show notes for this episode on the website at talkpython.fm/425 
1h 10m
Jul 2023
#422: How data scientists use Python
See the full show notes for this episode on the website at talkpython.fm/422 
1h 2m
Jun 2022
#370: OpenBB: Python's Open-source Investment Platform
See the full show notes for this episode on the website at talkpython.fm/370 
54m 28s