This post uses various R libraries and functions to help you explore your Twitter Analytics Data. The first thing to do is download data from analytics.twitter.com. The assumption here is that you’re already a Twitter user and have been using for at least 6 months.
Once there, you’ll click on the Tweets
tab, which should bring you to your Tweet activity with the option to Export data:
In this post, we’ll explore Gradient Descent from the ground up starting conceptually, then using code to build up our intuition brick by brick.
While this post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus, for this post I am drawing on external sources including Aurélien Geron’s Hands-On Machine Learning to provide a context for why and when gradient descent is used.
We’ll also be using external libraries such as numpy
, that are generally avoided in Data Science from Scratch, to help highlight concepts.
While the book introduces gradient descent as a standalone topic, I find it more intuitive to reason about it within the context of a regression problem. …
This is a quick walk through of using the sunburstR
package to create sunburst plots in R. The original document is written in RMarkdown
, which is an interactive version of markdown.
The following code can be run in RMarkdown or an R script. For interactive visuals, you’ll want to use RMarkdown.
The two main libraries are tidyverse
(mostly dplyr
so you can just load that if you want) and sunburstR
. There are other packages for sunburst plots including: plotly and ggsunburst (of ggplot), but we'll explore sunburstR in this post.
library(tidyverse)
library(sunburstR)
The data is from week 50 of TidyTuesday, exploring the BBC’s top 100 influential women of 2020. …
This is a continuation of my progress through Data Science from Scratch by Joel Grus. We’ll use a classic coin-flipping example in this post because it is simple to illustrate with both concept and code. The goal of this post is to connect the dots between several concepts including the Central Limit Theorem, hypothesis testing, p-Values and confidence intervals, using python to build our intuition.
Terms like “null” and “alternative” hypothesis are used quite frequently, so let’s set some context. The “null” is the default position. The “alternative”, alt for short, is something we’re comparing to the default (null).
The classic coin-flipping exercise is to test the fairness off a coin. If a coin is fair, it’ll land on heads 50% of the time (and tails 50% of the time). Let’s translate into hypothesis testing…
Using statistics to help users find your product
Itertools
are a core set of fast, memory efficient tools for creating iterators for efficient looping (read the documentation here).
One (of many) uses for itertools
is to create a permutations()
function that will return all possible combinations of items in a list.
I was working on a project that involved user funnels with different stages and we were wondering how many different “paths” a user could take, so this was naturally a good fit for using permutations.
There are several posts that could serve as context (as needed) for the concepts discuss in this post including these posts on:
In this post, we’ll cover probability distributions. This is a broad topic so we’ll sample a few concepts to get a feel for it. Borrowing from the previous post, we’ll chart our medical diagnostic outcomes.
You’ll recall that each outcome is the combination of whether someone has a disease, P(D)
, or not, P(not D)
. Then, they're given a diagnostic test that returns positive, P(P)
or negative, P(not P)
.
These are discrete outcomes so they can be represented with the probability mass function, as opposed to a probability density function, which represent a continuous distribution. …
note: This article presents a hypothetical situation and is not intended as medical advice.
Now that we have a basic understanding of Bayes’ Theorem (please refer to these posts on conditional probability and Bayes’ Theorem for context), let’s extend the application to a slightly more complex example. This section was inspired by this tweet from Grant Sanderson (of 3Blue1Brown fame):
This post is a in continuation of my coverage of Data Science from Scratch by Joel Grus.
It picks up from the previous post, so be sure to check that out for proper context.
Building on our understanding of conditional probability we’ll get into Bayes’ Theorem. We’ll spend some time understanding the concept before we implement an example in code.
Previously, we established an understanding of conditional probability, but building up with marginal and joint probabilities. We explored the conditional probabilities of two outcomes:
The probability for outcome one is roughly 50% or (1/2).
The probability for outcome two is roughly 33% or (1/3). …
This post is chapter 6 in continuation of my coverage of Data Science from Scratch by Joel Grus. We will work our way towards understanding conditional probability by understanding preceding concepts like marginal and joint probabilities.
At the end, we’ll tie all concepts together through code. For those inclined, you can jump to the code towards the bottom of this post.
The first challenge in this section is distinguishing between two conditional probability statements.
Here’s the setup. We have a family with two (unknown) children with two assumptions. First, each child is equally likely to be a boy or a girl. …
This post is chapter 5 in continuation of my coverage of Data Science from Scratch by Joel Grus.
It should be noted upfront that everything covered in this post can be done more expediently and efficiently in libraries like NumPy as well as the statistics module in Python.
The primary value of this book, and by extension this post, in my opinion, is the emphasis on learning how Python primitives can be used to build tools from the ground up. Here’s a visual preview:
Specifically, we’ll examine how specific features of the Python language as well as functions we built in a previous post on Vectors (see also Matrices) can be used to build tools used to describe data and relationships within data (aka statistics). …
About