Using Pandas for Data Analysis

Introduction

Over 9,000 developers took part in a survey organized by www.stateofjs.com. The participants were interviewed on a variety of topics with regards to Javascript. One of the most significant questions on the survey revolved around the salary developers were earning at their current positions.

This guide is centered around that one question. We will be using pandas to perform basic data analysis on the survey’s results. Using the data from the survey, we are going to compare the distribution of income among javascript developers, and ultimately create a visual representation of our data using Matplotlib. The intention of this guide is to expand and reinforce concepts learned in part one of this series.


Requirements

– In order to get the most out of this guide, a basic knowledge of python is assumed.

– Thou not required, I recommend reading part one of this guide.

– Anaconda/Jupiter notebooks or idle with Pandas, and Matplotlib installed on your system.

– The CSV that contains the dataset we’ll be using. can be found here.

Getting Started

Make sure to import all the necessary libraries for this project. The top of your code should look as follows:

Import pandas
Import matplotlib
From matplotlib import pyplot as plt

To start answering this question. We must first start by loading the CSV file into memory. On line 1 of our code we are using the .read_csv() method to load the database into memory, and pointing it to a variable named jsDatabase

 [1] jsDatabase = pandas.read_csv("stateofjs.csv")

The .read_csv() method returns a data frame, this allows us to list the columns in our database with the method .columns.values.list()

[2] jsDatabase.columns.values.tolist()

We need to isolate yearly salary from the rest of the columns. This can be done by creating a new variable and pointing it to the ‘Yearly Salary’ column of the data frame.

 salariesDatabase = jsDatabase['Yearly Salary']

The salariesDatabase now holds only the data held by the ‘Yearly Salary’ column. Taking a quick look at our first ten entries using the .head() method, we can see the expected salary ranges.

[3] jsDatabase.head() 
[Output]
0 NaN
1 $30-50k
2 NaN
3 $30-50k
4 $30-50k
5 $30-50k
6 $30-50k
7 $100k-$200k

You might have noticed something off with our output. The world NaN appears a couple of times in our results. NaN stands for not a number, it acts as a filler for missing data. Because we can’t do any meaningful calculations with a NaN value, it is important that we “clean up” our data by removing every entry that’s missing data.

Getting Rid of Useless Data

In order to get rid of all the ‘NaN’s, we need to drop every data row that is missing information. This can be done with the .dropna() method. We are going to assign our salariesDatabase a subset of its own data frame by dropping all the entries that have missing data.

 salariesDatabase = salariesDatabase.dropna()

a quick check for integrity

salariesDatabase.head()

Now that our data is clean, we are ready to start categorizing it. When developers took the survey, they were presented with five options:

Work for Free.
10k-30k.
30k-50k.
50k-100k.
100k-200k

We will represent these five choices by creating five variables.You will see some inconsistency in the way the answers are formatted, for example, “$10-$30k” and “$100k-$200k” wherein one variable the letter K follows both numbers but in the other K only follows the last number We inherited this format from our database so we have to work with it. Keep in mind that the strings must match the expected input or you will get no result.

For example:

This returns a result

ten_to_thirty = len(salariesDatabase[salariesDatabase == '$10-$30k'])

Does not

ten_to_thirty = len(salariesDatabase[salariesDatabase == '$10k-$30k']

So we can calculate and compare values easily. We are interested in the total amount of entries per category, therefore we need to find out the total amount of developers in each category, and store each number in a separate variable. The most efficient thing to do would be to store each value in a list, but for presentations purposes, we will be using five variables.These variables represent the five options. Previously discussed.

workforfree = len(salariesDatabase[salariesDatabase == 'I work for free :('])
ten_to_thirty = len(salariesDatabase[salariesDatabase == '$10-$30k'])
thirtyfifty = len(salariesDatabase[salariesDatabase == '$30-50k'])
fifty_to_onehundred = len(salariesDatabase[salariesDatabase == '$50-$100k'])
one_hundred_two = len(salariesDatabase[salariesDatabase == '$100k-$200k'])

There’s a lot to unpack here We start by slicing our data frame. The line

salariesDatabase[salariesDatabase == “$10-30k”] returns every row that matches the option

“$10-30k”.

Perhaps the best way of searching for the answer is using regular expressions, but that could take a guide all by itself. So in the interest of keeping this guide short (and hopefully interesting), I opted out of using it.

After creating the new slice, we take the total(numeric) amount of entries for this section using Python’s built-in function len(). We’ll then assign the amount to our variable. If we print our variable, it will hold an integer with the total amount of entries for its assigned variable.

[input] print( ten_to_thirty )
[Output] 134

Creating a Python Function

We can create a python function in order to convert our number of entries into a percentage. Before creating the function lets create one more variable that holds the total amount of entries. This is going to allow us to calculate the percentage.

total = workforfree + tenthirty + thirtyfifty + fifty_to_onehundred + one_hundred_two

Our function needs take two parameters, the total amounts of entries as a whole, and a number of entries for the given section.

def ConvertNumToPercent(number, total):
"""Takes In an Integer/float, returns a float"""
percent = (number/total) * 100
return percent

It first finds the ratio of particular responses to total responses. It multiplies this ratio by 100 to convert it to a percentage and returns this result. We can use the returned result outside of the function.

Example:

[input] random = ConvertNumToPercent(10, 20)
print(random)
[Output] 50.00

For formatting purposes, we are going to be creating a second function

def format_num(number, total):
number_str = str(ConvertNumToPercent(number,total)) +"%"
return number_str

the converts our number to a string and concatenates (attaches) the “%” character after the number to clearly indicate that it’s a percentage.

Notice that this function calls on our previous function in order to determine the percentages. It

then converts the number into a string using Python’s built-in method str(), concatenates the ‘%’ character, and finally returns the string as a result.

Displaying Our Data

We are now ready to print our data:

choice1 = format_num(workforfree,total)
choice2= format_num(tenthirty,total)
choice3= format_num(thirtyfifty,total)
choice4=format_num(fifty_to_onehundred,total)
choice5=format_num(one_hundred_two,total)
print(
"Distribution of Income for Java Developers\n\n"
"Free: " + choice1 +
"\n$10-$30k: "+ choice2
"\n$30-$50k: " + choice3
"\n$50-$100k: " + choice4
"\n$100-$200k: " + choice5
)

It is now clear to see the majority of the developers who took this survey earned between $50k-$100K a year.

Sources / Resources

http://stateofjs.com/2016/introduction/

Looking for team training?