Pitfalls to avoid for new data scientists

Image for post
Image for post
Image by Andrew Ridley on Unsplash

Build a project portfolio.

Arguably the most pervasive advice in data science.

Listening to the excellent Build a Career in Data Science Podcast, I was surprised to learn few people heed this advice.

A portfolio showcases your interests, skills and abilities to reason about data. It can convince a hiring manager to give you a chance, and it’s also effective for learning and joining a community.

Some theories:

1. Peak performance happens when you’re stretched beyond your comfort zone, but not too much into the panic zone (see Yerkes-Dodson).


Let’s do better

Image for post
Image for post
source: https://krua.co/recipe/ก๋วยเตี๋ยวคั่วไก่/

“Let’s order Thai.”

“Great, what’s your go-to dish?”

“Pad Thai.”

This has bugged me for years.

Pad Thai shouldn’t be your first choice of Thai food.

Like Turkey on Thanksgiving, most Pad Thai is overrated. Instead of a bang, it’s a whimper.

There, I said it.

Pad Thai was created in the 1930’s to cultivate a sense of nationalism and combat rice shortages by promoting a noodle dish. Through a stroke of marketing genius, “Thai” found its way into the name.

So, what’s the alternative?

What you want is Kua Gai (คั่วไก่). Unlike Pad Thai, this stir fry dish…


Using R to visualize disparities in student debt and college attainment

Data suggests student debt bites twice.

First, stalling wealth creation.

Second, if it prevents people from finishing college, this further sets back wealth creation.

Previously, I examined differences in college degree attainment, between White, Black and Hispanic Americans.

Image for post
Image for post
Image by Author

The Widening Gap [1] lead to a hypothesis:

Data on Families with Student Loan Debt allows us to indirectly support or contradict our hypothesis.

Here are the results.

African American families are shouldering more student debt over the years than Hispanic or White Families [2].


In light of recent euphoria, here’s a compelling bear case.

Image for post
Image for post
Photo by Michael Dziedzic on Unsplash

I’m bullish Bitcoin and Ethereum.

And any technology to redistribute power, resist censorship and preserve privacy.

In light of the current crypto euphoria, I’d like to entertain the best bear case I’ve heard. Paraphrasing Demetri Kofinas, host of Hidden Forces:

When a gun is pointed at our face, will cryptography save us? I’d add, this is true for any nation state.

Demetri’s point is well taken.

I don’t…


Using R and Python to visualize the relationship between Market Cap and Hourly Cost to Attack

Image for post
Image for post
Image by Author

Overview

In this post, I use Python and R to access, parse, manipulate, then visualize data from Crypto51.app to show the strong relationship between Market Capitalization and Cost to Attack among public crypto networks.

The more a network is thought to be worth, the more expensive it is to attack. An important, but often overlooked reason to celebrate price gains.

Data

In this post, I query an API endpoint setup at Crypto51.app to get JSON data. Then, I use Python to parse and convert to dataframe


Rule-based Sentiment Analysis Using Python and R

Image for post
Image for post
Image by Author

Overview

Why Sentiment Analysis?

NLP is subfield of linguistic, computer science and artificial intelligence (wiki), and you could spend years studying it.

However, I wanted a quick dive to a get an intuition for how NLP works, and we’ll do that via sentiment analysis, categorizing text by their polarity.

We can’t help but feel motivated to see insights about our social media post, so we’ll turn to a well known platform.

How well does Facebook know us?

To find out, I downloaded 14 years of posts to apply text and sentiment analysis. We’ll use Python to read and parse json data from Facebook.

We’ll perform tasks such as tokenization…


Use R to find out which metrics drive people to click on your profile

Image for post
Image for post
Image by Author

Overview & Setup

This post uses various R libraries and functions to help you explore your Twitter Analytics Data. The first thing to do is download data from analytics.twitter.com. The assumption here is that you’re already a Twitter user and have been using for at least 6 months.

Once there, you’ll click on the Tweets tab, which should bring you to your Tweet activity with the option to Export data:


Using code to develop a feel for how machine learning optimization works

Image for post
Image for post
Photo by Fineas Anton on Unsplash

Overview

In this post, we’ll explore Gradient Descent from the ground up starting conceptually, then using code to build up our intuition brick by brick.

While this post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus, for this post I am drawing on external sources including Aurélien Geron’s Hands-On Machine Learning to provide a context for why and when gradient descent is used.

We’ll also be using external libraries such as numpy, that are generally avoided in Data Science from Scratch, to help highlight concepts.

While the book introduces gradient…


Exploring the BBC’s Top 100 Influential Women of 2020 with interactive plots

Image for post
Image for post
Image by Author

Overview

This is a quick walk through of using the sunburstR package to create sunburst plots in R. The original document is written in RMarkdown, which is an interactive version of markdown.

The following code can be run in RMarkdown or an R script. For interactive visuals, you’ll want to use RMarkdown.

Load Libraries

The two main libraries are tidyverse (mostly dplyr so you can just load that if you want) and sunburstR. There are other packages for sunburst plots including: plotly and ggsunburst (of ggplot), but we'll explore sunburstR in this post.

library(tidyverse)
library(sunburstR)

Load Data & Explore

The data is from week 50 of TidyTuesday


Building intuition for statistical concepts using code

Image for post
Image for post
Cover Photo by Nasonov Aleksandr on Unsplash

Overview

This is a continuation of my progress through Data Science from Scratch by Joel Grus. We’ll use a classic coin-flipping example in this post because it is simple to illustrate with both concept and code. The goal of this post is to connect the dots between several concepts including the Central Limit Theorem, hypothesis testing, p-Values and confidence intervals, using python to build our intuition.

Central Limit Theorem

Terms like “null” and “alternative” hypothesis are used quite frequently, so let’s set some context. The “null” is the default position. The “alternative”, alt for short, is something we’re the default (null).

The…

Paul Apivat Hanvongse

Data-Informed People Decisions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store