Join Mark and Allen on Two Voice Devs this week as they delve into a critical discussion about data scraping, large language models (LLMs), and the ethical responsibilities of developers. From the recent controversy surrounding BlueSky data scraping and Hugging Face datasets to the complexities of copyright law and personal privacy in the age of AI, this episode explores the gray areas and tough questions facing developers today. Hear their perspectives on the potential misuse of publicly available data, the challenges of anonymization, and the importance of upholding ethical standards in a rapidly evolving technological landscape. They also share personal anecdotes about navigating privacy policies and the dilemmas of data collection for business versus personal use. Tune in to gain valuable insights and contribute to the conversation about responsible development practices.

[00:00:00] Introduction

[00:01:04] Mark's deep dive into BlueSky's architecture and the data scraping controversy.

[00:02:27] Discussion on BlueSky's data policy and user ownership.

[00:05:32] Copyright implications of using scraped data in LLMs.

[00:06:22] Exploring ethical data sources for LLM training (Wikipedia, Reddit, etc.).

[00:08:31] Real-world examples of potential copyright infringement in image and video generation.

[00:09:34] Hugging Face's guidelines and the removal of the BlueSky dataset.

[00:12:19] The curious case of the "David Meyer" bug in ChatGPT and its implications for data privacy.

[00:14:24] Allen's personal dilemma with Vodo Drive's privacy policy and data collection for model training.

[00:16:50] Balancing business needs with ethical data practices.

[00:17:00] Allen's challenge gathering Gemini release notes and his ethical solution.

[00:19:20] The ethical responsibilities of software engineers, drawing parallels to the Challenger disaster.

[00:21:19] The developer's role in advocating for ethical data usage.

[00:22:21] Call to action: Share your thoughts and perspectives!

#DataScraping #LLMs #AIethics #DeveloperEthics #Privacy #Copyright #BlueSky #HuggingFace #SoftwareEngineering #DataPrivacy #AI #TwoVoiceDevs #Podcast #TechPodcast #WebSockets #DataScience #EthicalAI #ResponsibleAI #TechEthics #Gemini #GoogleAI

Episode 219 - The Ethics of Data Scrapin...