Blog | Peter Baumgartner

Reporting With Quarto

Recently I’ve been using Quarto to generate HTML reports to share with stakeholders and I’d like to share a workflow and configuration that has worked for me. Background Most of my programming I do in VSCode. A typical workflow involves using an IPython REPL for exploratory coding, organizing commonly used code into a package, and creating CLI functions with typer as an abstraction for common tasks. This approach solves 90% of my problems on data science projects and gives me a replicable workflow that’s easily modified....

Wrapping a Rust Crate in a Python Package

In this blog post I’ll walk through step-by-step how I wrapped the voronoice Rust crate and created a Python package with it. It’s written as a development journal, where I walk through my thought process and document the code and errors we’re getting along the way. My hope is that this makes the process a little more approachable for beginners and adds some transparency to the process. Motivation Right now I primarily use scipy....

Learning Rust With Advent of Code 2022

This year I decided to participate in Advent of Code (AoC) and use it as an opportunity to learn Rust. Since I was learning a new language, I also decided to try and use GitHub Copilot within VSCode. AoC is a series of 25 daily puzzles that typically need to be solved through programming. Each daily puzzle consists of two parts. This year I fully solved 18 of the 25 daily puzzles and solved part one of an additional 2 puzzles, - so I submitted 39 of a possible 50 answers....

Bootstrapped Sampling for Annotation: A Multiverse of Madness?

Note: If you haven’t read my prior post on the bootstrap and inter-rater reliability, this post probably won’t make sense. Go read that first. After I had began discussing my last post with a few people, they had begun to replicate my analysis. Their replications raised an important issue I hope to address here: estimation is dependent on the initial sample of annotations that we have. What I failed to mention is that in the original example we were annotating data in a specific universe1....

"How much data do I need to label?" - The Bootstrapped Inter-Rater Reliability Answer

One of the most frequent questions that arises when doing applied ML projects is “How much data to I need to label?” When I get asked this question, I usually ask a few questions in return: what’s the base rate of the outcome that you’re labeling? Are you experimenting or building a production-ready system? How ambiguous or well-defined is your annotation task? Is all of your data ready to be annotated1, or do you need to figure out some preprocessing to get an annotation-ready dataset?...

Reasons to Blog: A Rebuttal to Myself

“Writing has so much to give, so much to teach, so many surprises. That thing you had to force yourself to do—the actual act of writing—turns out to be the best part. It’s like discovering that while you thought you needed the tea ceremony for the caffeine, what you really needed was the tea ceremony. The act of writing turns out to be its own reward.” - Anne Lamott...

What Should Data Scientists Learn?

This week my pal Vicki tweeted this, which I disagreed with: The funny part is that every single response to this is correct. Should you learn Python? Yeah. Should you learn K8s? Yeah. Should you focus on SQL? Yeah. https://t.co/i9KHigqLXL pic.twitter.com/ZCOvI9PIVI — Vicki (@vboykis) April 20, 2022 I responded that beginners shouldn’t learn K8S, and Vicki said they might not have a choice depending on their job, I summed up my primary objection in a reply:...

An Introduction to Just Enough Cython to be Useful

Since starting work at Explosion, I’ve been trying to learn more about Cython. About 16% of spaCy’s codebase is Cython, so I decided to pick up a book and learn from that. I did a few example projects and started thinking: now that types are cool in python, why don’t more people use Cython? In case you’re unfamiliar with Cython, here’s my incremental and oversimplified explanation of what Cython can do:...

Python Virtual Environment & Packaging Workflow

In this blog post I’ll walk you through the workflow I use for managing virtual environments and creating python packages. I have a long list of criteria I’ve used to develop this workflow and have honed it over time since the start of my career as a data scientist. The factors that I’ve determined are critically important for choosing these tools are as follows: They work for me Yep, that’s it....

My Personal History with NLP or Side-Effects of Good API Design

I’m joining Explosion AI as a Machine Learning Engineer. This is my first career move in 6 years and I thought I’d take some time to reflect on my personal experience in data science and natural language processing. Since I’ve been in data science, I’ve been working in professional services/consulting environments. My last job was working mostly with social scientists and researchers to incorporate machine learning into their research projects. Consulting takes the “jack of all trades, master of none” spirit of data science and cranks it up to 11 by having to work across multiple projects....

Two Logging Options Better than Print Statements

This is a short blog post on two things I’ve found helpful when I’m not using notebooks and running python code that I want to log things in. Loguru Here’s an example of how I use loguru. Typically I don’t include all the boilerplate at the top, but for illustration of some of the functionality I’m including it. In this snippet I’m changing the default logger format by adding the time elapsed since program start....

Ways I Use Testing as a Data Scientist

In my work, writing tests serves three purposes: making sure things work, documenting my understanding, preventing future errors. When I was starting out with testing, I had a hard time understanding what I should be writing tests for. As a beginner, I just assumed my code worked–I was staring right at the output in a notebook and visually inspecting that the output was correct. After gaining some experience writing tests, I realized one of my problems initially was the fact that knowing what to test requires some experience in knowing what can go wrong....

Notes on NLP for Survey Design

I’ve worked on several projects where we’re applying some natural language processing techniques on responses to open-ended survey items. Typically this means putting them into categories—either following a pre-existing coding scheme or by creating a new one with unsupervised learning. Through these experiences, I’ve developed a few principles for working on these types of problems. Put Yourself Out Of Business There’s not a project I’ve been on where my recommendation wasn’t “If you want better data, turn this open-ended question into a close-ended question that asks about the specific thing you care about....

Incorporating Julia Into Python Programs

Context: I’ve recently been experimenting with porting portions of a simulation codebase from python to Julia. Setting up a productive development environment, using the packages (PyJulia & PyCall) that allow for communicating between python and Julia, and familiarizing myself with Julia enough to use those packages took quite a bit of time and experimentation. Here’s my collection of notes including stumbling blocks, adaptations, and things I took forever to understand to make this process easier for others in the future....

The Fastest Way to Load Data Into Your Django Project using PostgreSQL

tl;dr: Load data up to 77x faster with django-postgres-copy and an in-memory csv. Go to results. When starting a new Django project often my first step is to create a model and bulk load in some existing data. As I’ve learned more about Django and databases, I’ve learned a few ways to speed up the data loading process. In this post I’ll walk through progressively more efficient ways of loading data, and at the end of the post measure the performance of the methods using fake data generated by the wonderful Faker library....