This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.
In this series, I'll use it to demonstrate the awesome power Python can bring to HR data
Sections
__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# if continuing on from the previous section, read the data from saved file
# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)
# save data for later
# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
Pandas? We've already looked at pandas. Yes, but we've not explored the plotting capabilities of pandas.
This section, let's explore Gender.
empl_data['Gender'].value_counts(normalize=True).plot(kind='bar', title='Employee Gender')
~60% of the employees are Male, ~40% Female.
empl_data.pivot_table(values='HourlyRate', columns='Gender', aggfunc='mean')\
.plot(kind='bar', title='Average Hourly Salary by Gender and Job Level')
Positively, there does not appear to be any variance in the overall average pay between gender.
Let's explore further
# by job level
empl_data.pivot_table(values='HourlyRate',index='JobLevel', columns='Gender', aggfunc='mean')\
.plot(kind='bar', title='Average Hourly Salary by Gender and Job Level')
Again, no significant findings. Let's look at one more...
# by department
empl_data.pivot_table(values='HourlyRate',index='Department', columns='Gender', aggfunc='mean')\
.plot(kind='bar', title='Average Hourly Salary by Gender and Department')
Here, it would initially appear that within Sales there is a a significant pay gap for Female employees.
empl_data.pivot_table(values='PerformanceRating', columns='Gender', aggfunc='mean')\
.plot(kind='bar', title='Average Performance Rating by Gender and Job Level')
Performance also appears equal between Genders. What about by Job Level?
empl_data.pivot_table(values='PerformanceRating',index='JobLevel', columns='Gender', aggfunc='mean')\
.plot(kind='bar', title='Average Performance Rating by Gender and Job Level')
And now, what about by Department? Will we see the same as with Average Salary?
empl_data.pivot_table(values='PerformanceRating',index='Department', columns='Gender', aggfunc='mean')\
.plot(kind='bar', title='Average Performance Rating by Gender and Department')
Average Performance appears equal - even in Sales; this does not explain the salary difference between genders.
empl_data.columns
empl_data.pivot_table(values='HourlyRate',index='PerformanceRating', columns='Gender', aggfunc='mean')\
.plot.barh(title="Average Performance Rating by Gender")
With pandas, we can even strip away some of the transformations and call methods directly on the DataFrame.
empl_data.boxplot(column='HourlyRate', by='Gender', grid=False)
# suppress warning message from matplotlib
import warnings; warnings.simplefilter('ignore')
# more information on warning here: https://github.com/MichaelGrupp/evo/issues/28
empl_data[empl_data['Gender'] == 'Male']['HourlyRate'].plot.hist(by='Gender', alpha=0.5, normed=True)
empl_data[empl_data['Gender'] == 'Female']['HourlyRate'].plot.hist(by='Gender', alpha=0.5, normed=True)
Having explored Education, we learned:
Pandas plotting is powerful; you have the ability to plot directly from your DataFrames.
Here's the secret - pandas doesn't do any plotting. As was mentioned in the previous section, matplotlib, other packages build on top of matplotlib. pandas is no exception. In fact, the .plot method is just a wrapper around matplotlib calls.
Still, this can be more effective that calling matplotlib directly. When working with DataFrames, it's easy to transform the data and pass to .plot matplotlib via the wrapper methods.