logo
episode-header-image
May 2018
17m 46s

MLA 003 Storage: HDF, Pickle, Postgres

OCDevel
About this episode

Practical workflow of loading, cleaning, and storing large datasets for machine learning, moving from ingesting raw CSVs or JSON files with pandas to saving processed datasets and neural network weights using HDF5 for efficient numerical storage. It clearly distinguishes among storage options—explaining when to use HDF5, pickle files, or SQL databases—while highlighting how libraries like pandas, TensorFlow, and Keras interact with these formats and why these choices matter for production pipelines.

Links Data Ingestion and Preprocessing
  • Data Sources and Formats:

    • Datasets commonly originate as CSV (comma-separated values), TSV (tab-separated values), fixed-width files (FWF), JSON from APIs, or directly from databases.
    • Typical applications include structured data (e.g., real estate features) or unstructured data (e.g., natural language corpora for sentiment analysis).
  • Pandas as the Core Ingestion Tool:

    • Pandas provides versatile functions such as read_csv, read_json, and others to load various file formats with robust options for handling edge cases (e.g., file encodings, missing values).
    • After loading, data cleaning is performed using pandas: dropping or imputing missing values, converting booleans and categorical columns to numeric form.
  • Data Encoding for Machine Learning:

    • All features must be numerical before being supplied to machine learning models like TensorFlow or Keras.
    • Categorical data is one-hot encoded using pandas.get_dummies, converting strings to binary indicator columns.
    • The underlying NumPy array of a DataFrame is accessed via df.values for direct integration with modeling libraries.
Numerical Data Storage Options
  • HDF5 for Storing Processed Arrays:

    • HDF5 (Hierarchical Data Format version 5) enables efficient storage of large multidimensional NumPy arrays.
    • Libraries like h5py and built-in pandas functions (to_hdf) allow seamless saving and retrieval of arrays or DataFrames.
    • TensorFlow and Keras use HDF5 by default to store neural network weights as multi-dimensional arrays for model checkpointing and early stopping, accommodating robust recovery and rollback.
  • Pickle for Python Objects:

    • Python's pickle protocol serializes arbitrary objects, including machine learning models and arrays, into files for later retrieval.
    • While convenient for quick iterations or heterogeneous data, pickle is less efficient with NDarrays compared to HDF5, lacks significant compression, and poses security risks if not properly safeguarded.
  • SQL Databases and Spreadsheets:

    • For mixed or heterogeneous data, or when producing results for sharing and collaboration, relational databases like PostgreSQL or spreadsheets such as CSVs are used.
    • Databases serve as the endpoint for production systems, where model outputs—such as generated recommendations or reports—are published for downstream use.
Storage Workflow in Machine Learning Pipelines
  • Typical Process:

    • Data is initially loaded and processed with pandas, then converted to numerical arrays suitable for model training.
    • Intermediate states and model weights are saved using HDF5 during model development and training, ensuring recovery from interruptions and facilitating early stopping.
    • Final outputs, especially those requiring sharing or production use, are published to SQL databases or shared as spreadsheet files.
  • Best Practices and Progression:

    • Quick project starts may involve pickle for accessible storage during early experimentation.
    • For large-scale, high-performance applications, migration to HDF5 for numerical data and SQL for production-grade results is recommended.
    • Alternative options like Feather and PyTables (an interface on top of HDF5) exist for specialized needs.
Summary
  • HDF5 is optimal for numerical array storage due to its efficiency, built-in compression, and integration with major machine learning frameworks.
  • Pickle accommodates arbitrary Python objects but is suboptimal for numerical data persistence or security.
  • SQL databases and spreadsheets are used for disseminating results, especially when human consumption or application integration is required.
  • The selection of a storage format is determined by data type, pipeline stage, and end-use requirements within machine learning workflows.
Up next
Jul 14
MLA 027 AI Video End-to-End Workflow
How to maintain character consistency, style consistency, etc in an AI video. Prosumers can use Google Veo 3’s "High-Quality Chaining" for fast social media content. Indie filmmakers can achieve narrative consistency by combining Midjourney V7 for style, Kling for lip-synced dial ... Show More
1h 11m
Jul 12
MLA 026 AI Video Generation: Veo 3 vs Sora, Kling, Runway, Stable Video Diffusion
Google Veo leads the generative video market with superior 4K photorealism and integrated audio, an advantage derived from its YouTube training data. OpenAI Sora is the top tool for narrative storytelling, while Kuaishou Kling excels at animating static images with realistic, hig ... Show More
40m 39s
Jul 9
MLA 025 AI Image Generation: Midjourney vs Stable Diffusion, GPT-4o, Imagen & Firefly
The AI image market has split: Midjourney creates the highest quality artistic images but fails at text and precision. For business use, OpenAI's GPT-4o offers the best conversational control, while Adobe Firefly provides the strongest commercial safety from its exclusively licen ... Show More
58m 51s
Recommended Episodes
Jul 28
Revolutionizing Python Notebooks with Marimo
SummaryIn this episode of the Data Engineering Podcast Akshay Agrawal from Marimo discusses the innovative new Python notebook environment, which offers a reactive execution model, full Python integration, and built-in UI elements to enhance the interactive computing experience. ... Show More
51m 56s
Jan 2025
Breaking Down Data Silos: AI and ML in Master Data Management
Summary In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data ... Show More
57m 30s
Nov 2018
ML/DL for Non-Stationary Time Series Analysis in Financial Markets and Beyond with Stuart Reid - TWiML Talk #203
Today, we’re joined by Stuart Reid, Chief Scientist at NMRQL Research. NMRQL is an investment management firm that uses ML algorithms to make adaptive, unbiased, scalable, and testable trading decisions for its funds. In our conversation, Stuart and I dig into the way NMRQL uses ... Show More
58m 29s
Jul 2024
#225 The Full Stack Data Scientist with Savin Goyal, Co-Founder & CTO at Outerbounds
The role of the data scientist is changing. Some organizations are splitting the role into more narrowly focused jobs, while others are broadening it. The latter approach, known as the Full Stack Data Scientist, is derived from the concept of a full stack software engineer, with ... Show More
48m 44s
Sep 2021
An Exploration Of The Data Engineering Requirements For Bioinformatics
Summary Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a uniqu ... Show More
55m 10s
Dec 2024
The Art of Database Selection and Evolution
Summary In this episode of the Data Engineering Podcast Sam Kleinman talks about the pivotal role of databases in software engineering. Sam shares his journey into the world of data and discusses the complexities of database selection, highlighting the trade-offs between differen ... Show More
59m 56s
Aug 2024
Snowflake's Baris Gultekin on Unlocking the Value of Data With Large Language Models - Ep. 231
Snowflake is using AI to help enterprises transform data into insights and applications. In this episode of NVIDIA’s AI Podcast, host Noah Kravitz and Baris Gultekin, head of AI at Snowflake, discuss how the company’s AI Data Cloud platform enables customers to access and manage ... Show More
32m 10s
Feb 2025
863: TabPFN: Deep Learning for Tabular Data (That Actually Works!), with Prof. Frank Hutter
Jon Krohn talks tabular data with Frank Hutter, Professor of Artificial Intelligence at Universität Freiburg in Germany. Despite the great steps that deep learning has made in analysing images, audio, and natural language, tabular data has remained its insurmountable obstacle. In ... Show More
1h 6m
Feb 2021
442: Data Science as an Atomic Habit
In today’s episode, I discuss how focusing on process and habit building can provide more for you and your professional progress than simply chasing a goal. Additional materials: www.superdatascience.com/442 
7m 3s