episode-header-image

Feb 2022

39m 35s

Scaling Knowledge Management For Technic...

About this episode

Summary

One of the most persistent challenges faced by organizations of all sizes is the recording and distribution of institutional knowledge. In technical teams this is exacerbated by the need to incorporate technical review feedback and manage access to data before publishing. When faced with this problem as an early data scientist at AirBnB, Chetan Sharma helped create the Knowledge Repo project as a solution. In this episode he shares the story behind its creation and growth, how and why it was released as open source, and the features that make it a compelling option for your own team’s knowledge management journey.

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Your host as usual is Tobias Macey and today I’m interviewing Chetan Sharma about Knowledge Repo, an open source framework for managing documentation for technical users

Interview

Introductions
How did you get introduced to Python?
- EE + CS/AI + Stats degrees
- Airbnb working on ML models
- Knowledge Repo itself
Can you describe what Knowledge Repo is and the story behind it?
- We started seeing interviewees use ipython notebooks, thought they were great
- Wanted to push more people to use notebooks, but they weren’t very shareable, vettable
- Existing notebook hosting services weren’t very good, and weren’t built for people who aren’t data stakeholders. It was especially poor with images, annoying cell blocks
- Made a simple post processor to remove cell blocks, push the images to s3, and host on flask
- Once we were pushing notebooks into a Github repo for hosting on a flask app, so many things became possible
  - Review cycles
  - Shareability / collaboration features
  - Indexing / searching
- Concurrently, great work was happening on developing internal R packages / python libraries to provide consistent, branded aesthetics
What are some of the approaches that teams typically take for recording and sharing institutional knowledge?
- Copy and paste to google docs, slides
- Facebook was using facebook photo albums
- untrustworthy, not discoverable, divorced from the code
What are the unique requirements that are introduced when attempting to record and distribute learnings related to data such as A/B experiments, analytical methods, data sets, etc.?
- Reproducibility is a big one
- Making sure the learnings are trustworthy (good data? no bugs?)
- Distributing widely, across the org and across time
- Experimentation
  - Experimentation is at the end of a research-design-build-measure cycle, strategic analysis is often before
  - Capturing all of the context
Can you describe how the Knowledge Repo project is architected?
- Repositories: a store of posts, most commonly a github repo
- Markdown as original lingua franca, eventually a KR specific “KR post” concept (which is still basically markdown)
- Post processors
  - Convert whatever upstream file to markdown / KR post (Jupyter notebook, R Markdown, markdown were the original ones)
  - Handle images and other large assets, usually pushing them to cloud storage
  - Evolved to handle PDFs, googledocs, keynotes
What were the motivating factors for making it available as an open source project?
- It was such a common problem. Even incredibly sophisticated data teams at Uber, Facebook, etc. were begging us to share the system.
What is the workflow for creating, sharing, and discovering information in an installation of Knowledge Repo?
- Create a github repo for hosting strategic analysis
- Use the KR script to create a stub/template for whatever format you’re working in
- Do your work in Jupyter, etc.
- Instead of using github scripts (git add) use knowledge scripts (knowledge add), which is basically the github scripts with postprocessors
- Do typical Github workflows
- See the result in the hosted knowledge repo app
What are some of the options available for extending or customizing an installation of Knowledge Repo?
- More postprocessors! google docs, presentations, UX research, anything can be done in KR with a simple postprocessor to turn it to markdown/images/PDF
- Tying the system to your internal data tools. For example, an experimentation system like Eppo or whatever you use for marketing campaigns
If you were to start over today, what are some of the ways that you might approach the solution to knowledge management differently?
- Think of it more holistically:
What are the most interesting, innovative, or unexpected ways that you have seen Knowledge Repo used?
- UX research
- Writing up guide for acquihiring
- Demonstrating of capabilities, data framework
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Knowledge Repo?
- Strategic analysis needs to be elevated, this leads to paradigm changes
- Organization problems are helped by tools like KR: eg. promotions
- Meeting people’s tools/workflows where they are is powerful
When is Knowledge Repo the wrong choice?

Keep In Touch

Picks

Tobias
- Learning Guitar
Chetan
- Underrated cooking ingredients: chickpea flour, butter fried kimchi (in grilled cheese, nachos)

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Up next

Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary The majority of machine learning projects that you read about or work on are built around batch processes. The model i ... Show More

Declarative Machine Learning For High Performance Deep Learning Models With Predibase

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Deep learning is a revolutionary category of machine learning that accelerates our ability to build powerful inference ... Show More

Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the mo ... Show More

Recommended Episodes

Revolutionizing Python Notebooks with Marimo

SummaryIn this episode of the Data Engineering Podcast Akshay Agrawal from Marimo discusses the innovative new Python notebook environment, which offers a reactive execution model, full Python integration, and built-in UI elements to enhance the interactive computing experience. ... Show More

#495: OSMnx: Python and OpenStreetMap

See the full show notes for this episode on the website at <a href="https://talkpython.fm/495">talkpython.fm/495</a>

Context Engineering as a Discipline: Building Governed AI Analytics

SummaryIn this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Nick Schrock, CTO and founder of Dagster Labs, to discuss Compass - a Slack-native, agentic analytics system designed to keep data teams connected with business stakeholders. Nick shares his j ... Show More

An Exploration Of The Data Engineering Requirements For Bioinformatics

<div class="wp-block-jetpack-markdown"><h2>Summary</h2> <p>Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research ... Show More

From Academia to Industry: Bridging Data Engineering Challenges

SummaryIn this episode of the Data Engineering Podcast Professor Paul Groth, from the University of Amsterdam, talks about his research on knowledge graphs and data engineering. Paul shares his background in AI and data management, discussing the evolution of data provenance and ... Show More

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

<div class="wp-block-jetpack-markdown"><h2>Summary</h2> <p>Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional la ... Show More

High Performance And Low Overhead Graphs With KuzuDB

SummaryIn this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. ... Show More

Data Quality Management For The Whole Team With Soda Data

<div class="wp-block-jetpack-markdown"><h2>Summary</h2> <p>Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impa ... Show More

The Evolution of DataOps: Insights from DataKitchen's CEO

Summary In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures ... Show More

The Future of Data Engineering: AI, LLMs, and Automation

Summary In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large langua ... Show More

Listen to millions of songs and podcasts on Anghami