logo
episode-header-image
Jan 2025
54m 40s

CSVs Will Never Die And OneSchema Is Cou...

Tobias Macey
About this episode
Summary
In this episode of the Data Engineering Podcast Andrew Luo, CEO of OneSchema, talks about handling CSV data in business operations. Andrew shares his background in data engineering and CRM migration, which led to the creation of OneSchema, a platform designed to automate CSV imports and improve data validation processes. He discusses the challenges of working with CSVs, including inconsistent type representation, lack of schema information, and technical complexities, and explains how OneSchema addresses these issues using multiple CSV parsers and AI for data type inference and validation. Andrew highlights the business case for OneSchema, emphasizing efficiency gains for companies dealing with large volumes of CSV data, and shares plans to expand support for other data formats and integrate AI-driven transformation packs for specific industries.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. 
  • Your host is Tobias Macey and today I'm interviewing Andrew Luo about how OneSchema addresses the headaches of dealing with CSV data for your business
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Despite the years of evolution and improvement in data storage and interchange formats, CSVs are just as prevalent as ever. What are your opinions/theories on why they are so ubiquitous?
  • What are some of the major sources of CSV data for teams that rely on them for business and analytical processes?
  • The most obvious challenge with CSVs is their lack of type information, but they are notorious for having numerous other problems. What are some of the other major challenges involved with using CSVs for data interchange/ingestion?
  • Can you describe what you are building at OneSchema and the story behind it?
    • What are the core problems that you are solving, and for whom?
  • Can you describe how you have architected your platform to be able to manage the variety, volume, and multi-tenancy of data that you process?
    • How have the design and goals of the product changed since you first started working on it?
  • What are some of the major performance issues that you have encountered while dealing with CSV data at scale?
  • What are some of the most surprising things that you have learned about CSVs in the process of building OneSchema?
  • What are the most interesting, innovative, or unexpected ways that you have seen OneSchema used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on OneSchema?
  • When is OneSchema the wrong choice?
  • What do you have planned for the future of OneSchema?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Up next
Jul 6
Foundational Data Engineering At Two Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Recommended Episodes
Apr 2023
2344: Cloudera: Moving Beyond Big Data to Hybrid Data Mastery
I sit down with Chris Royles, EMEA Field CTO at Cloudera, to discuss the evolution of Big Data and why hybrid data is the next challenge for businesses to tackle. In this episode, we explore how the term 'Big Data' has become dated and how the rapid rise of hybrid data has shifte ... Show More
39m 54s
Nov 2024
#262 Self-Service Business Intelligence with Sameer Al-Sakran, CEO at Metabase
We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here.We’re often caught chasing the dream of “self-serve” data—a place where data empowers stakeholders to answer th ... Show More
51m 33s
Jun 30
Is Artificial Intelligence Going to Make Excel Obsolete with Christian Torres
Welcome to the Artificial Intelligence Podcast with Jonathan Green! In this episode, we delve into the evolving role of AI in data management with our expert guest, Christian Torres. Christian is a talent with a unique flair for using AI to streamline processes within Excel, leav ... Show More
28m 10s
Jul 2024
Low-Code Magic: Can It Transform Analytics? (Ep. 260)
Join us as David Marom, Head of Panoply Business, explores the benefits of all-in-one data platforms. Learn how tech stack consolidation boosts efficiency, improves data accuracy, and cuts costs. David shares insights on overcoming common challenges, enhancing data governance, an ... Show More
33m 45s
Nov 2024
#259 Getting the Data For Your Data-Driven Decisions with Jonathan Bloch & Scott Voigt
We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here.Understanding where the data you use comes from, how to use it responsibly, and how to maximize its value has b ... Show More
46m 16s
Jul 2022
IoT, IIoT and Managing Edge Data
Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.SHOW: 634CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST ... Show More
35m 37s
Mar 2017
MetPy: Taming The Weather With Python
Summary What’s the weather tomorrow? That’s the question that meteorologists are always trying to get better at answering. This week the developers of MetPy discuss how their project is used in that quest and the challenges that are inherent in atmospheric and weather research. I ... Show More
52m 23s
Feb 2024
830. Insights: The COBOL conundrum: Why are banks still using a 60-year-old programming language?
What has 200 billion lines, is used by 43% of banks, and pretty much drives the global financial system?It's the subject of today's Fintech Insider Insights, as our expert hosts, 11:FS CEO David Brear and CTO Ewan Silver, come together to discuss all things COBOL, or Common Busin ... Show More
37m 3s
Jun 3
163: How to Analyze Data in the New Era of AI
Try Julius.ai 👉 https://bit.ly/4jn4cFFCoupon code: AVERY25AI is transforming how we work, how we make decisions, and how we understand the world through data. In this episode, I explore how Julius AI can simplify your data tasks, automate repetitive work, and offer valuable insi ... Show More
19m 9s
Dec 2024
Best of 2024: The Art of Prompt Engineering with Alex Banks, Founder and Educator, Sunday Signal
As we look back at 2024, we're highlighting some of our favourite episodes of the year, and with 100 of them to choose from, it wasn't easy!The four guests we'll be recapping with are:Lea Pica - A celebrity in the data storytelling and visualisation space. Richie and Lea cover th ... Show More
44m 58s