Exploring IBM HR data using Python

Intro

This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.

In this series, I'll use it to demonstrate the awesome power Python can bring to HR data

Sections

  • Statistics
  • Matplotlib
  • Pandas
  • Seaborn
  • Plotly
  • Findings
In [1]:
__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"
In [2]:
# imports 
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [3]:
# if continuing on from the previous section, read the data from saved file

# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
In [4]:
# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)

# save data for later
# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

seaborn

In this section, we'll continue with visualizations using the seaborn library.

Seaborn aims to use sensible defaults for style and color choices. As with pandas .plot methods, Seaborn is an extension to Matplotlib, which is where the plotting happens. Seaborn helps to make this easier and more effective.

We'll begin on our analysis in this section looking at Education.

In [5]:
# matplotlib.pyplot

# explicitly view default of matplotlib
plt.style.use('default') 

# plot Education count
plt.bar(sorted(empl_data['Education'].unique()),empl_data.groupby('Education')['EmployeeCount'].count())
Out[5]:
<BarContainer object of 5 artists>
In [6]:
# seaborn

# explicitly view default of seaborn
sns.set()

# plot Education count
sns.countplot(empl_data['Education'])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a36c02e8>

Seaborn provides sensible defaults that improve the readability of the visualizations.

In [7]:
empl_data.groupby('Education')['EmployeeCount'].count()
Out[7]:
Education
1    170
2    282
3    572
4    398
5     48
Name: EmployeeCount, dtype: int64
In [8]:
sorted(empl_data['Education'].unique())
Out[8]:
[1, 2, 3, 4, 5]

Education appears to encode the values, we'll take a guess at what these values represent and store in a dictionary.

In [9]:
ed_level_desc = {1: 'GED', 2: 'High School Diploma', 3: 'Bachelors Degree', 4: 'Masters Degree', 5: 'PhD'}
In [10]:
ed_ranking = empl_data['Education'].value_counts()
ed_ranking.rename(index=ed_level_desc)
Out[10]:
Bachelors Degree       572
Masters Degree         398
High School Diploma    282
GED                    170
PhD                     48
Name: Education, dtype: int64
In [11]:
# one more time, as percentages
ed_ranking = round(empl_data['Education'].value_counts(normalize=True)*100,0)
ed_ranking.rename(index=ed_level_desc)
Out[11]:
Bachelors Degree       39.0
Masters Degree         27.0
High School Diploma    19.0
GED                    12.0
PhD                     3.0
Name: Education, dtype: float64

Just 3% of the employees in this dataset have a PhD (*with the assumption a '5' in Education translates to 'PhD')

In [12]:
# explore education against Job Level
ed_job = empl_data.pivot_table(values='Age',index='Education', columns='JobLevel', aggfunc='count')
ed_job
Out[12]:
JobLevel 1 2 3 4 5
Education
1 89 47 20 8 6
2 94 125 33 17 13
3 231 171 98 44 28
4 121 171 58 28 20
5 8 20 9 9 2

Now we've calculated enough data that it takes us as humans time and effort to process this 5x5 grid. Let's visualize this data using Seaborn to help us understand this easier.

In [13]:
sns.heatmap(ed_job)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a38ba940>

In this colormap, darker colors represent lower values and lighter colors higher values. The near white square contains the highest value. We can improve and customize this further to our liking.

In [14]:
plt.figure(figsize=(9,4))

# add annotations, border, and change colormap
sns.heatmap(ed_job, annot=True, linewidth=0.4, fmt='d', cmap='YlOrRd')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3cb5a90>

The greatest number of associates (231) have an education level of 3 and job level of 1. The concentration of education and job level is easily identified as well.

In [15]:
# by Job Role
plt.figure(figsize=(9,4))
role_ed_xtab = pd.crosstab(empl_data['JobRole'], empl_data['Education'], normalize='index')
sns.heatmap(role_ed_xtab, annot=True, fmt='0.0%', cmap='YlOrRd')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3cc8588>

While we clearly see that level 3 education is the dominant education across all job roles, there are some insights from this figure.

  • Heathceare Representatives and Sales Executives have the highest frequencies of level 4 education.
  • Sales Representatives skew lowest on the scale. The highest concentration of level 1, and 0% at level 5.

Finally, we'll look at what the employees studied, by Job Role.

In [16]:
plt.figure(figsize=(9,4))
# let's try blue
role_field_xtab = pd.crosstab(empl_data['JobRole'], empl_data['EducationField'], normalize='index')
sns.heatmap(role_field_xtab, annot=True, fmt='0.0%', cmap='Blues')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3e79550>
  • Much like Education level, there is a dominant education field - Life Sciences.
  • Those who studied HR landed in HR.
  • Marketing majors landed in Sales roles.
In [17]:
# does education level determine hourly rate?

sns.boxplot(x='Education', y='HourlyRate', data=empl_data)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3e61c88>

Possibly - but from this does not appear significantly, and only due to level 5 having a higher median - the other 4 levels appear similar.

In [18]:
# include gender

# default Seaborn styling
sns.set_style('darkgrid')

sns.boxplot(x='Education', y='HourlyRate', data=empl_data, hue='Gender')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
Out[18]:
<matplotlib.legend.Legend at 0x1a2a3e1fd30>

Seaborn's sensible defaults make visualizing data easier. The plots generated so far are aesthetically pleasing, but barplots and boxplots are available by default from Matplotlib. Like heatmaps, Seaborn also constructs more advanced plots that are still sent to Matplotlib for plotting, but Seaborn does the heavy lifting for us.

In [19]:
# stripplot
sns.stripplot(x='Education', y='HourlyRate', data=empl_data, jitter=True, hue='Gender', dodge=True)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
Out[19]:
<matplotlib.legend.Legend at 0x1a2a36c01d0>
In [20]:
# violinplots
sns.violinplot(x='Education', y='HourlyRate', data=empl_data)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3575e48>
In [21]:
# swarmplots
sns.swarmplot(x='Education', y='HourlyRate', data=empl_data)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a40407b8>

Finally PairPlots - a great, but computationally expensive operation, that shows you all of your data...

In [23]:
sns.pairplot(empl_data)
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1a2a40b2f98>

That's all of it alright, a bit too much to really do anything with. Do be careful with this as it can take quite a long time to process. If you know your data and the shape is suited for this, it's a great first step at times.

For now, we'll just pass in the first 10 columns and view the pairplot.

In [31]:
sns.pairplot(empl_data.iloc[:,0:10])
Out[31]:
<seaborn.axisgrid.PairGrid at 0x1a2ec3d16a0>

Much easier to process this information. We passed 10 columns, however pairplot only plotted 6 combinations. To understand this, let's have a look at those first 10 columns.

In [36]:
empl_data.iloc[:,0:10].columns
Out[36]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber'],
      dtype='object')
In [39]:
empl_data.iloc[:,0:10].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 10 columns):
Age                 1470 non-null int64
Attrition           1470 non-null object
BusinessTravel      1470 non-null object
DailyRate           1470 non-null int64
Department          1470 non-null object
DistanceFromHome    1470 non-null int64
Education           1470 non-null int64
EducationField      1470 non-null object
EmployeeCount       1470 non-null int64
EmployeeNumber      1470 non-null int64
dtypes: int64(6), object(4)
memory usage: 114.9+ KB

In the first 10 columns, 6 are int64 and 4 are object data types. Pairplot only plots numeric columns and ignores the rest, hence we have 6 plotted in our smaller pairplot.

Section Findings

Having explored Education, we learned:

  • The most common education level attained is level 3.
  • The least common education level attained is level 5.
  • Education does not appear to affect Compensation.
  • Across the education levels, compensation does not appear to be significantly varied between genders..

Seaborn Recap

Seaborn continued building on top of Matplotlib, applying sensible defaults to the visualizations. Improving the output means developers/analysts can focus more of their effort on extracting insights from their data. Seaborn helps to strip away some of the lower-level details. Of course, the customization is always there should it be necessary. Finally, Seaborn helps us create more advanced plots that would take much more effort directly in Matplotlib - consider for moment what it would take to create the pairplot directly in Matplotlib. Fortunately, we have Seaborn to worry about that for us.

We've created some great visualizations to help us understand this dataset. These visualizations have all been static. Next, let's explore some packages and resources that allow us to interact with our visualizations. Onward!