logo
episode-header-image
Jul 2020
21m 10s

What data transformation library should ...

FRANCESCO GADALETA
About this episode

In this episode I speak about data transformation frameworks available for the data scientist who writes Python code.
The usual suspect is clearly Pandas, as the most widely used library and de-facto standard. However when data volumes increase and distributed algorithms are in place (according to a map-reduce paradigm of computation), Pandas no longer performs as expected. Other frameworks play a role in such context. 

In this episode I explain the frameworks that are the best equivalent to Pandas in bigdata contexts.

Don't forget to join our Discord channel and comment previous episodes or propose new ones.

 

This episode is supported by Amethix Technologies

Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. Amethix is a consulting firm focused on data science, machine learning, and artificial intelligence.

 

References
Up next
Jul 7
Tech's Dumbest Mistake: Why Firing Programmers for AI Will Destroy Everything (Ep. 286) [RB]
From the viral article "Tech's Dumbest Mistake: Why Firing Programmers for AI Will Destroy Everything" on my newsletter at https://defragzone.substack.com/p/techs-dumbest-mistake-why-firing here are my thoughts about AI replacing programmers... 🎙️ Sponsors AGNTCY — The open sour ... Show More
18m 44s
Jun 18
Brains in the Machine: The Rise of Neuromorphic Computing (Ep. 285)
In this episode of Data Science at Home, we explore the fascinating world of neuromorphic computing — a brain-inspired approach to computation that could reshape the future of AI and robotics. The episode breaks down how neuromorphic systems differ from conventional AI architectu ... Show More
24m 18s
Jun 3
DSH/Warcoded - AI in the Invisible Battlespace (Ep. 284)
This episode explores the invisible battlespace of cyber and electronic warfare, where AI takes center stage. From autonomous hacking bots to smart jamming and adversarial attacks on machine learning models, we uncover how modern conflicts are increasingly fought with code, not b ... Show More
21m 22s
Recommended Episodes
Sep 2021
Massively Parallel Data Processing In Python Without The Effort Using Bodo
Summary Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challen ... Show More
1h 4m
Mar 2024
#454: Data Pipelines with Dagster
See the full show notes for this episode on the website at talkpython.fm/454 
58m 25s
Nov 2022
Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase
Summary The most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how ... Show More
59m 25s
Dec 2019
Building The Materialize Engine For Interactive Streaming Analytics In SQL
Summary Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying lev ... Show More
48m 7s
Jul 2023
#422: How data scientists use Python
See the full show notes for this episode on the website at talkpython.fm/422 
1h 2m
Sep 2022
#382: Apache Superset: Modern Data Exploration Platform
See the full show notes for this episode on the website at talkpython.fm/382 
1h 8m
Nov 2021
Exploring Processing Patterns For Streaming Data Integration In Your Data Lake
Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand th ... Show More
52m 53s
Aug 2023
#425: Memray: The endgame Python memory profiler
See the full show notes for this episode on the website at talkpython.fm/425 
1h 10m
Oct 2023
Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable
Summary Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the ... Show More
1h 8m