An end-to-end exploratory data project using R and Python

Image by Author


“Let’s order Thai.”

“Great, what’s your go-to dish?”

“Pad Thai.”

This has bugged me for years and is the genesis for this project.

People need to know they have other choices aside from Pad Thai. Pad Thai is one of 53 individual dishes and stopping there risks missing out on at least 201 shared Thai dishes (source: wikipedia).

This project is an opportunity to build a data set of Thai dishes by scraping tables off Wikipedia. We will use Python for web scraping and R for visualization. …

Pitfalls to avoid for new data scientists

Image by Andrew Ridley on Unsplash

Build a project portfolio.

Arguably the most pervasive advice in data science.

Listening to the excellent Build a Career in Data Science Podcast, I was surprised to learn few people heed this advice.

A portfolio showcases your interests, skills and abilities to reason about data. It can convince a hiring manager to give you a chance, and it’s also effective for learning and joining a community.

So why do few take this advice?

Some theories:

1. Peak performance happens when you’re stretched beyond your comfort zone, but not too much into the panic zone (see Yerkes-Dodson).

Let’s do better


“Let’s order Thai.”

“Great, what’s your go-to dish?”

“Pad Thai.”

This has bugged me for years.

Pad Thai shouldn’t be your first choice of Thai food.

Like Turkey on Thanksgiving, most Pad Thai is overrated. Instead of a bang, it’s a whimper.

There, I said it.

Pad Thai was created in the 1930’s to cultivate a sense of nationalism and combat rice shortages by promoting a noodle dish. Through a stroke of marketing genius, “Thai” found its way into the name.

So, what’s the alternative?

What you actually want is Kua Gai (คั่วไก่). Unlike Pad Thai, this stir fry dish…

Using R to visualize disparities in student debt and college attainment

Data suggests student debt bites twice.

First, stalling wealth creation.

Second, if it prevents people from finishing college, this further sets back wealth creation.

Previously, I examined differences in college degree attainment, between White, Black and Hispanic Americans.

Image by Author

The Widening Gap [1] lead to a hypothesis:

Wealth inequality is positively related to the widening gap in college degree attainment among the three groups.

Data on Families with Student Loan Debt allows us to indirectly support or contradict our hypothesis.

Here are the results.

African American families are shouldering more student debt over the years than Hispanic or White Families [2].

In light of recent euphoria, here’s a compelling bear case.

Photo by Michael Dziedzic on Unsplash

I’m bullish Bitcoin and Ethereum.

And any technology to redistribute power, resist censorship and preserve privacy.

In light of the current crypto euphoria, I’d like to entertain the best bear case I’ve heard. Paraphrasing Demetri Kofinas, host of Hidden Forces:

The U.S. dollar’s legitimacy comes from the government’s ability to level force and violence. Men with guns can demand your private keys.

When a gun is pointed at our face, will cryptography save us? I’d add, this is true for any nation state.

Demetri’s point is well taken.

I don’t…

Using R and Python to visualize the relationship between Market Cap and Hourly Cost to Attack

Image by Author


In this post, I use Python and R to access, parse, manipulate, then visualize data from to show the strong relationship between Market Capitalization and Cost to Attack among public crypto networks.

The more a network is thought to be worth, the more expensive it is to attack. An important, but often overlooked reason to celebrate price gains.


In this post, I query an API endpoint setup at to get JSON data. Then, I use Python to parse and convert to dataframe

Rule-based Sentiment Analysis Using Python and R

Image by Author


Why Sentiment Analysis?

NLP is subfield of linguistic, computer science and artificial intelligence (wiki), and you could spend years studying it.

However, I wanted a quick dive to a get an intuition for how NLP works, and we’ll do that via sentiment analysis, categorizing text by their polarity.

We can’t help but feel motivated to see insights about our own social media post, so we’ll turn to a well known platform.

How well does Facebook know us?

To find out, I downloaded 14 years of posts to apply text and sentiment analysis. We’ll use Python to read and parse json data from Facebook.

We’ll perform tasks such as tokenization…

Use R to find out which metrics drive people to click on your profile

Image by Author

Overview & Setup

This post uses various R libraries and functions to help you explore your Twitter Analytics Data. The first thing to do is download data from The assumption here is that you’re already a Twitter user and have been using for at least 6 months.

Once there, you’ll click on the Tweets tab, which should bring you to your Tweet activity with the option to Export data:

Using code to develop a feel for how machine learning optimization works

Photo by Fineas Anton on Unsplash


In this post, we’ll explore Gradient Descent from the ground up starting conceptually, then using code to build up our intuition brick by brick.

While this post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus, for this post I am drawing on external sources including Aurélien Geron’s Hands-On Machine Learning to provide a context for why and when gradient descent is used.

We’ll also be using external libraries such as numpy, that are generally avoided in Data Science from Scratch, to help highlight concepts.

While the book introduces gradient…

Exploring the BBC’s Top 100 Influential Women of 2020 with interactive plots

Image by Author


This is a quick walk through of using the sunburstR package to create sunburst plots in R. The original document is written in RMarkdown, which is an interactive version of markdown.

The following code can be run in RMarkdown or an R script. For interactive visuals, you’ll want to use RMarkdown.

Load Libraries

The two main libraries are tidyverse (mostly dplyr so you can just load that if you want) and sunburstR. There are other packages for sunburst plots including: plotly and ggsunburst (of ggplot), but we'll explore sunburstR in this post.


Load Data & Explore

The data is from week 50 of TidyTuesday

Paul Apivat

Data-Informed People Decisions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store