logo
episode-header-image
Mar 2020
54m 8s

Scaling Data Governance For Global Busin...

Tobias Macey
About this episode

Summary

Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about using an architectural pattern called data hub that allows for scaling data management across global businesses

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of the goals of a data hub architecture?
  • What are the elements of a data hub architecture and how do they contribute to the overall goals?
    • What are some of the patterns or reference architectures that you drew on to develop this approach?
  • What are some signs that an organization should implement a data hub architecture?
  • What is the migration path for an organization who has an existing data platform but needs to scale their governance and localize storage and access?
  • What are the features or attributes of an individual hub that allow for them to be interconnected?
    • What is the interface presented between hubs to allow for accessing information across these localized repositories?
  • What is the process for adding a new hub and making it discoverable across the organization?
  • How is discoverability of data managed within and between hubs?
  • If someone wishes to access information between hubs or across several of them, how do you prevent data proliferation?
    • If data is copied between hubs, how are record updates accounted for to ensure that they are replicated to the hubs that hold a copy of that entity?
    • How are access controls and data masking managed to ensure that various compliance regimes are honored?
    • In addition to compliance issues, another challenge of distributed data repositories is the question of latency. How do you mitigate the performance impacts of querying across multiple hubs?
  • Given that different hubs can have differing rules for quality, cleanliness, or structure of a given record how do you handle transformations of data as it traverses different hubs?
    • How do you address issues of data loss or corruption within those transformations?
  • How is the topology of a hub infrastructure arranged and how does that impact questions of data loss through multiple zone transformations, latency, etc.?
  • How do you manage tracking and reporting of data lineage within and across hubs?
  • For an organization that is interested in implementing their own instance of a data hub architecture, what are the necessary components of an individual hub?
    • What are some of the considerations and useful technologies that would assist in creating and connecting hubs?
      • Should the hubs be implmeneted in a homogeneous fashion, or is there room for heterogeneity in their infrastructure as long as they expose the appropriate interface?
  • When is a data hub architecture the wrong approach?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Up next
Nov 16
State, Scale, and Signals: Rethinking Orchestration with Durable Execution
Summary&nbsp;<br />In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, ... Show More
51m 46s
Nov 9
The AI Data Paradox: High Trust in Models, Low Trust in Data
Summary<br />In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in ... Show More
51m 35s
Nov 2
Bridging the AI–Data Gap: Collect, Curate, Serve
SummaryIn this episode of the Data Engineering Podcast Omri Lifshitz (CTO) and Ido Bronstein (CEO) of Upriver talk about the growing gap between AI's demand for high-quality data and organizations' current data practices. They discuss why AI accelerates both the supply and demand ... Show More
50m 40s
Recommended Episodes
Jun 2025
The state of play of data center development
The future of the grid increasingly hinges on where and how data centers get built. To forecast the kind of power infrastructure we need to meet AI’s growing appetite, we first need to understand a laundry list of variables: data center size, workload type, latency, reliability — ... Show More
36m 24s
Apr 2023
2344: Cloudera: Moving Beyond Big Data to Hybrid Data Mastery
I sit down with Chris Royles, EMEA Field CTO at Cloudera, to discuss the evolution of Big Data and why hybrid data is the next challenge for businesses to tackle. In this episode, we explore how the term 'Big Data' has become dated and how the rapid rise of hybrid data has shifte ... Show More
39m 54s
Jul 2022
IoT, IIoT and Managing Edge Data
<p>Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.</p><p><b>SHOW: 634</b></p><p><b>CLOUD NEWS OF THE WEEK - </b><a href='http://bit.ly/cloudcast-cnot ... Show More
35m 37s
Jan 2025
3164: Breaking Data Silos: How Hammerspace is Powering AI Storage and Hybrid Cloud
<p>As part of the IT Press Tour in Silicon Valley, I had the opportunity to sit down with David Flynn, CEO of Hammerspace, to explore how the company is redefining the future of enterprise data storage.</p> <p>At a time when AI-driven workloads and hybrid cloud computing are push ... Show More
24m 26s
Aug 21
#286 Enterprise Architecture: Secret Weapon for Transformation
In this episode of "Embracing Digital Transformation," host Dr. Darren speaks with guest Dr. Pallab Saha, General Manager at The Open Group, about the pivotal role of enterprise architecture in guiding organizations through digital transformations. They delve into the importance ... Show More
33m 59s
Mar 2025
#295 How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy
The role of data and AI engineers is more critical than ever. With organizations collecting massive amounts of data, the challenge lies in building efficient data infrastructures that can support AI systems and deliver actionable insights. But what does it take to become a succes ... Show More
52m 27s
Sep 15
#321 Developing Financial AI Products at Experian with Vijay Mehta, EVP of Global Solutions & Analytics at Experian
Financial institutions are racing to harness the power of AI, but the path to implementation is filled with challenges. From feature engineering to model deployment, the technical complexities of AI adoption in finance require careful navigation of both technological and regulato ... Show More
49m 28s
Sep 9
Leading across technical domains, strategic deep-dives & applying your skills in new industries w/ Simone Kalmakis #231
<p>How do you apply your leadership skills to a new, mission-driven industry and effectively lead teams across multiple technical domains? In this episode, Simone Kalmakis (VPE @ Viam) shares her playbook for successfully transitioning between industries from health-tech and clim ... Show More
43m 17s
Aug 21
Turning Legacy Service Contracts into First Time Fix Wins - with Joe Lang of Comfort Systems
<p><span data-preserver-spaces="true">Today's guest is Joe Lang, Vice President of Service Technology and Innovation at Comfort Systems USA — a leading national provider of mechanical, electrical, and plumbing building systems services, with more than 45 operating companies acros ... Show More
25m 9s
Mar 2025
The potential for flexible data centers
Tyler Norris says regulators have been getting two different stories. On one side, they’ve been hearing that data centers are largely inflexible loads. On the other, last year the U.S. Department of Energy recommended data center flexibility, and EPRI launched its DCFlex initiati ... Show More
31m 29s