Collections and Comprehensions (pt.2)

In the previous post, we began examining a toy data set see what kind of Python concepts from the crash course we’d see in action.

What stands out is the use of collections and comprehension. We’ll see this trend continue as data is given to us in the form of a list of dict or tuples.

Often time, we’re manipulating the data to make it faster and more efficient to iterate through the data. The tool that comes up quite often is using defaultdict to initialize an empty list. Followed by list comprehensions to iterate through data.

Indeed, either we’re seeing how the author, specifically, approaches problem or how problems are approached in Python, in general.

What I’m keeping in mind is that there are more than one way to approach data science problems and this is one of them.

With that said, let’s pick up where the previous post left off.

We have a sense of the total number of connections and a sorting of the most connected individuals. Now, we may want to design a “people you may know” suggester.

Quick recap, here’s what the friendship dictionary looks like.

Again, the first step is to iterate over friends and collect friends’ friend. The following function returns a list comprehension. Let’s examine this function line-by-line to understand how it works. It returns friend_of_a_friend (foaf) id for each of the individuals’ id, then grabbing the id of their friends.

We’ll break it down in code below this function:

To answer this we’ll use a Counter, which we learned is a dict subclass. Moreover, the function friends_of_friends(user),

In addition to friendship data, we also have interest data. Here we see a list of tuples, containing a user_id and a string representing a specific of technology.

First thing we’ll do is find users with a specific interest. This is function returns a list comprehension. It first split each tuple into user_id (integer) and user_interest (string), then conditionally check if the string in the tuple matches the input parameter.

We may also want to count the number of times a specific interest comes up. Here’s a function for that. We use a basic for-loop and if-statement to check truthiness of user_interest == target_interest.

A concern is having to examine a whole list of interests for every search. The author proposes building an index from interests to users. Here, a defaultdict is imported, then populated with user_id

We can find who has the most interests in common with a given user. Looks like Klein (#9) has the most common interests with Hero (#0). Here we return a Counter with for-loops and an if-statement.

Finally, we can also find which topics are most popular among the network. Previously, we calculated the number of users interested in a particular topic, but now we want to compare the whole list.

We’re also given anonymous salary and tenure (number of years work experience) data, let’s see what we can do with that information. First we’ll find the average salary. Again, we’ll start by creating a list (defaultdict), then loop through salaries_and_tenures.

The problem is that this is not terribly informative as each tenure value has a different salary. Not even the average_salary_by_tenure is informative, so our next move is to group similar tenure values together.

First, we’ll create the groupings/categories using a control-flow, then we’ll create a list(defaultdict), and loop through salaries_and_tenures to populate the newly created salary_by_tenure_bucket. Finally calculate the average.

One thing to note is that the “given” data, in this hypothetical toy example is either in a list of dictionaries or tuples, which may be atypical if we're used to working with tabular data in dataFrame (pandas) or native data.frame in R.

Again, we are reminded that the higher purpose of this book — Data Science from Scratch (by Joel Grus; 2nd Ed) is to eschew libraries in favor of plain python to build everything from the ground up.

Should your goal be to learn how various algorithms work by building them up from scratch, and in the process learn how data problems can be solved with python and minimal libraries, this is your book.

Joel Grus does make clear that you would use libraries and frameworks (pandas, scikit-learn, matplotlib etc), rather than coded-from-scratch algorithms when working in production environments and will point out resource for further reading at the end of the chapters.

In the next post, we’ll get into visualizing data.


For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store