This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.
In this series, I'll use it to demonstrate the awesome power Python can bring to HR data
Sections
__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# if continuing on from the previous section, read the data from saved file
# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)
# save data for later
# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
In this section, we'll continue with visualizations using the seaborn library.
Seaborn aims to use sensible defaults for style and color choices. As with pandas .plot methods, Seaborn is an extension to Matplotlib, which is where the plotting happens. Seaborn helps to make this easier and more effective.
We'll begin on our analysis in this section looking at Education.
# matplotlib.pyplot
# explicitly view default of matplotlib
plt.style.use('default')
# plot Education count
plt.bar(sorted(empl_data['Education'].unique()),empl_data.groupby('Education')['EmployeeCount'].count())
# seaborn
# explicitly view default of seaborn
sns.set()
# plot Education count
sns.countplot(empl_data['Education'])
Seaborn provides sensible defaults that improve the readability of the visualizations.
empl_data.groupby('Education')['EmployeeCount'].count()
sorted(empl_data['Education'].unique())
Education appears to encode the values, we'll take a guess at what these values represent and store in a dictionary.
ed_level_desc = {1: 'GED', 2: 'High School Diploma', 3: 'Bachelors Degree', 4: 'Masters Degree', 5: 'PhD'}
ed_ranking = empl_data['Education'].value_counts()
ed_ranking.rename(index=ed_level_desc)
# one more time, as percentages
ed_ranking = round(empl_data['Education'].value_counts(normalize=True)*100,0)
ed_ranking.rename(index=ed_level_desc)
Just 3% of the employees in this dataset have a PhD (*with the assumption a '5' in Education translates to 'PhD')
# explore education against Job Level
ed_job = empl_data.pivot_table(values='Age',index='Education', columns='JobLevel', aggfunc='count')
ed_job
Now we've calculated enough data that it takes us as humans time and effort to process this 5x5 grid. Let's visualize this data using Seaborn to help us understand this easier.
sns.heatmap(ed_job)
In this colormap, darker colors represent lower values and lighter colors higher values. The near white square contains the highest value. We can improve and customize this further to our liking.
plt.figure(figsize=(9,4))
# add annotations, border, and change colormap
sns.heatmap(ed_job, annot=True, linewidth=0.4, fmt='d', cmap='YlOrRd')
The greatest number of associates (231) have an education level of 3 and job level of 1. The concentration of education and job level is easily identified as well.
# by Job Role
plt.figure(figsize=(9,4))
role_ed_xtab = pd.crosstab(empl_data['JobRole'], empl_data['Education'], normalize='index')
sns.heatmap(role_ed_xtab, annot=True, fmt='0.0%', cmap='YlOrRd')
While we clearly see that level 3 education is the dominant education across all job roles, there are some insights from this figure.
Finally, we'll look at what the employees studied, by Job Role.
plt.figure(figsize=(9,4))
# let's try blue
role_field_xtab = pd.crosstab(empl_data['JobRole'], empl_data['EducationField'], normalize='index')
sns.heatmap(role_field_xtab, annot=True, fmt='0.0%', cmap='Blues')
# does education level determine hourly rate?
sns.boxplot(x='Education', y='HourlyRate', data=empl_data)
Possibly - but from this does not appear significantly, and only due to level 5 having a higher median - the other 4 levels appear similar.
# include gender
# default Seaborn styling
sns.set_style('darkgrid')
sns.boxplot(x='Education', y='HourlyRate', data=empl_data, hue='Gender')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
Seaborn's sensible defaults make visualizing data easier. The plots generated so far are aesthetically pleasing, but barplots and boxplots are available by default from Matplotlib. Like heatmaps, Seaborn also constructs more advanced plots that are still sent to Matplotlib for plotting, but Seaborn does the heavy lifting for us.
# stripplot
sns.stripplot(x='Education', y='HourlyRate', data=empl_data, jitter=True, hue='Gender', dodge=True)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
# violinplots
sns.violinplot(x='Education', y='HourlyRate', data=empl_data)
# swarmplots
sns.swarmplot(x='Education', y='HourlyRate', data=empl_data)
Finally PairPlots - a great, but computationally expensive operation, that shows you all of your data...
sns.pairplot(empl_data)
That's all of it alright, a bit too much to really do anything with. Do be careful with this as it can take quite a long time to process. If you know your data and the shape is suited for this, it's a great first step at times.
For now, we'll just pass in the first 10 columns and view the pairplot.
sns.pairplot(empl_data.iloc[:,0:10])