This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.
In this series, I'll use it to demonstrate the awesome power Python can bring to HR data
Sections
__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)
# alternatively, save the file for repeated use
# we'll reference the saved file in the future portions of this analysis
# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
pandas provides a number of built-in methods allowing us to quickly and easily explore our data. When reading our data, we stored it in a DataFrame. If that it is a new term, for now think of it sort like an Excel sheet - but way better.
Let's see what we've got...
# view the first 5 rows
empl_data.head()
# how many rows, columns are in the DataFrame?
empl_data.shape
# how many different data points do we have?
empl_data.size
# what are the names of all the columns?
empl_data.columns
# info about the dataframe
empl_data.info()
Python performs it's work 'in-memory' meaning your computer is holding all of this data in memory. HR data isn't typically 'big data', but it's import to have an understanind of the resource demands of your data analysis.
empl_data.memory_usage()
the memory_usage method provides us with the memory used by each column in our new DataFrame. To get the entire amount of consumed memory, we can just add each item together.
memory_used = empl_data.memory_usage().sum()
print(f'The employee DataFrame is using {memory_used} bytes in memory.')
We are using approximately 411kb of system memory to store the DataFrame. Scroll back up to the output of the .info() method. This actually contained the memory usage as well, though the value is slightly less. Why? The memory_usage method includes the memory used by the index of the DataFrame.
But there's still more, and to enable a more accurate summary, we'll pass an optional item that accounts for the full usage of contained objects.
memory_used = empl_data.memory_usage(deep=True).sum()
print(f'The employee DataFrame is using {memory_used} bytes in memory.')
Nearly 3x the original result. You should have no problem holding this amount, but with much larger data sets and complex analysis, you may want to be aware of memory usage when processing on a local computer.
Enough with the background info, let's see the data!
# statistics about the DataFrame's numerical fields
empl_data.describe()
Describe allows us to see descriptive statistics for the numerical Series in the DataFrame Note that non-numeric fields, such as 'Department' are not included in the results.
Very helpful, but I don't want to scroll horizontally. Let's fix that.
# transpose the DataFrame
empl_data.describe().transpose()
Transposing the output of describe allows us to view without scrolling. This works well for this DataFrames with many columns.
We've managed to get the data in to the DataFrame, but how do we now get data out of the DataFrame?
We don't always want the full DataFrame, we may want to select items that only meet certain criteria.
Let's have a look at how to do just that.
# select all the Training counts from last year
empl_data['TrainingTimesLastYear']
The selection returned to us a Series containing all of the Training data. The '..' applied by the Jupyter Notebook condensed the output so we didn't have to see all 1,469 rows.
We'll now store this Series in a new 'training' variable and generate some statistics to help us understand the Training in the organization.
# store the result
trainings = empl_data['TrainingTimesLastYear']
As the result is now stored in memory, it is not returned to the screen. To view it, we'll have to explicitly ask for it.
# view the training variable contents
trainings
Let's now use this to answer the following questions:
# total number of trainings
trainings.sum()
# total number of employees
trainings.count()
# average number of trainings
trainings.mean()
# maximum number of trainings
trainings.max()
# minimum number of trainings
trainings.min()
# number of employees in each group
trainings.value_counts()
# number of employees not receiving training last year
trainings.value_counts()[0]
# percentage of employees not receiving training
print('{:.0f}% of employees did not receive training last year.'.format(((trainings.value_counts()[0])/trainings.count())*100))
In this section we covered a lot of ground after just a few simple steps: