Blog | Peter Baumgartner

Tableau and Raw Data as Sneaky Technical Debt

We work on a lot of projects where we need to iterate quickly on visualizations generated from a flat file of data. Tableau is a great tool to get started with this, but its ease of use can quickly lead people towards illusions of progress, misunderstandings of the underlying data, or technical debt. Usually this comes up in a conversation like this: Them: I want to build this specific type of chart that represents this information....

Making the Most of spaCy's Rule-Based Matcher

TL;DR The Rule-Based Matcher in spaCy is awesome when you have small datasets, need to explain your algorithm, locate specific language patterns within a document, favor performance and speed, and you’re comfortable with the token attributes needed to write rules. I created a notebook runnable in binder with a worked example on a dataset of product reviews from Amazon that replicates a workflow I successfully used on a recent project....

Applied NLP: Lessons from the Field

These are the main points and resources for my talk, Applied NLP: Lessons from the Field, delivered at spaCy IRL 2019. Summary Natural Language Processing projects often fail in their conception, in their delivery, or in their impact. To identify good candidate problems for NLP, talk to the client presenting you with a problem and first discuss how you would solve the problem without NLP. To successfully deliver an NLP project, acknowledge that project management is a skill and learn how to do it well: communicate the uncertainty you face and come up with metaphors to explain your work to non-technical stakeholders....

Fine-Tuning GPT-2 Small for Generative Text

Why did the chicken cross the road? Because it had no legs. These are the types of hilarious jokes the gpt-2 small model can generate for you. After reading a few blog posts here and here, and playing around with gpt-2 small myself, I thought I would write up the full process I used to fine-tune and produce generative text. For this example, we’ll use a dataset of jokes pulled from the /r/jokes subreddit to fine tune the GPT-2 small model to generate new jokes....

Potential Issues with Criminal Justice Data

Summary: I wrote this document based on my experience validating a risk assessment instrument. These were some of the issues, rewritten for generalizability, that we encountered. Accessible, quality data is often a project bottleneck, and I’ve found these helpful to consider before working on a project with criminal justice data. Comments: I wrote these notes in the context of evaluating a risk assessment instrument, so “risk factors” are independent variables and “outcomes” are dependent variables....

Notes on NLP Projects

Summary: These are some notes, combined with my own experience and commentary, derived from Matthew Honnibal’s PyData Berlin 2018 talk: Building new NLP solutions with spaCy and Prodigy. I intended to use these as a reference when starting new NLP projects. In NLP and ML we talk a lot about models and optimization. But this isn't where the battle is really won! I've been trying to explain my thoughts on this lately....

Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition

Introduction When we think of data science, we often think of statistical analysis of numbers. But, more and more frequently, organizations generate a lot of unstructured text data that can be quantified and analyzed. A few examples are social network comments, product reviews, emails, interview transcripts. For analyzing text, data scientists often use Natural Language Processing (NLP). In this blog post we’ll 3 we’ll walk through 3 common NLP tasks and look at how they can be used together to analyze text....

Word Embeddings Explainer

What are word embeddings? Imagine if every word had an address you could look up in an address book. Now also imagine if words that shared meaning lived in the same neighborhood. This is a simplified metaphor for word embeddings. For a visual example, here are simplified word embeddings for common 4- and 5-letter english words. I’ve drawn 3 neighborhoods over this embedding to illustrate the semantic groupings. What are they good for?...

How to Test IPython Magic

TL;DR If you want to test ipython magics you can do the following: Import the global ipython app with from IPython.testing.globalipapp import get_ipython Crete an object with the global ipython app with ip = get_ipython() Load your magic with ip.magic('load_ext your_magic_name') Run your magic with ip.run_line_magic('your_magic_function', 'your_magic_arguments') (Optional) Access results of your magic with ip.user_ns (ipython user namespace). An example test using pytest looks like this: import pytest from IPython....

An Exploration in Earth & Word Movers Distance

This post will be an exploration into Earth Mover’s Distance as well as its application to NLP problems through Word Movers Distance. To get started, we’ll follow the benign pedagogical path of copying the Wikipedia definition: The earth mover’s distance (EMD) is a measure of the distance between two probability distributions over a region D. In mathematics, this is known as the Wasserstein metric. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be amount of dirt moved times the distance by which it is moved....

The Impact of Model Output Transformations on ROC

Risk Assessment tools are currently used to assist in decision-making at several points in the criminal justice system. These tools take in some data about an individual and to provide a ‘risk score’ for an individual that’s reflective of their likelihood of committing a specific behavior in the future. A standard outcome of interest is recidivism, or a person's relapse into criminal behavior, often after the person receives sanctions or undergoes intervention for a previous crime (NIJ)....

Some Tips for Using Jupyter Notebooks with Pelican

Note (2018-04-09): I no longer use Pelican as the engine to build my blog, so you want be able to see parts of this workflow in this blog's repository. The insipiration for ideas in this post is captured in this notebook from Chris Albon. Switching static site generators is a great way to kill a few hours on the weekend. I was previously using Jekyll because it works seamlessly with Github Pages, but I’m a python person so I figured I’d learn something new and move everything over to Pelican....

PyData Carolinas Recap & Presentation Reflection

PyData Carolinas Recap & Presentation Reflection I was fortunate enough this year to attend the first PyData Carolinas conference, though my attendance was only made possible with the development and delivery of a tutorial talk. My colleague Rob and I proposed and then delivered a 90m tutorial talk on using NetworkX to do Social Network Analysis in Python (repo, video coming soon). The talk was aimed at intermediate users of python with some experience with data and the python language....

Creating Slack Slash Commands with Python and Flask: Part 1

Note (2018-04-09): Slack's API has changed since I wrote this article. I also never wrote a part 2. If you want an up to date tutorial this blog from DigitalOcean is good. Part 1: Setting Up Our Workflow and a Simple Application A few weekends ago my pet project was to set up a drive time slash command in Slack. Searching through our organization’s Slack conversation history, on top of overhearing several conversations, it seems like traffic is both a source of anguish and a favorite topic for smalltalk in our office....