Exploring IBM HR data using Python

Intro

This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.

In this series, I'll use it to demonstrate the awesome power Python can bring to HR data

Sections

Statistics
Matplotlib
Pandas
Seaborn
Plotly
Findings

__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"

# imports 
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)

# alternatively, save the file for repeated use
# we'll reference the saved file in the future portions of this analysis

# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

pandas

pandas provides a number of built-in methods allowing us to quickly and easily explore our data. When reading our data, we stored it in a DataFrame. If that it is a new term, for now think of it sort like an Excel sheet - but way better.

Let's see what we've got...

# view the first 5 rows
empl_data.head()

# how many rows, columns are in the DataFrame?
empl_data.shape

(1470, 35)

# how many different data points do we have?
empl_data.size

51450

# what are the names of all the columns?
empl_data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

# info about the dataframe
empl_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
Age                         1470 non-null int64
Attrition                   1470 non-null object
BusinessTravel              1470 non-null object
DailyRate                   1470 non-null int64
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EmployeeCount               1470 non-null int64
EmployeeNumber              1470 non-null int64
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
HourlyRate                  1470 non-null int64
JobInvolvement              1470 non-null int64
JobLevel                    1470 non-null int64
JobRole                     1470 non-null object
JobSatisfaction             1470 non-null int64
MaritalStatus               1470 non-null object
MonthlyIncome               1470 non-null int64
MonthlyRate                 1470 non-null int64
NumCompaniesWorked          1470 non-null int64
Over18                      1470 non-null object
OverTime                    1470 non-null object
PercentSalaryHike           1470 non-null int64
PerformanceRating           1470 non-null int64
RelationshipSatisfaction    1470 non-null int64
StandardHours               1470 non-null int64
StockOptionLevel            1470 non-null int64
TotalWorkingYears           1470 non-null int64
TrainingTimesLastYear       1470 non-null int64
WorkLifeBalance             1470 non-null int64
YearsAtCompany              1470 non-null int64
YearsInCurrentRole          1470 non-null int64
YearsSinceLastPromotion     1470 non-null int64
YearsWithCurrManager        1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.0+ KB

Python performs it's work 'in-memory' meaning your computer is holding all of this data in memory. HR data isn't typically 'big data', but it's import to have an understanind of the resource demands of your data analysis.

empl_data.memory_usage()

Index                          80
Age                         11760
Attrition                   11760
BusinessTravel              11760
DailyRate                   11760
Department                  11760
DistanceFromHome            11760
Education                   11760
EducationField              11760
EmployeeCount               11760
EmployeeNumber              11760
EnvironmentSatisfaction     11760
Gender                      11760
HourlyRate                  11760
JobInvolvement              11760
JobLevel                    11760
JobRole                     11760
JobSatisfaction             11760
MaritalStatus               11760
MonthlyIncome               11760
MonthlyRate                 11760
NumCompaniesWorked          11760
Over18                      11760
OverTime                    11760
PercentSalaryHike           11760
PerformanceRating           11760
RelationshipSatisfaction    11760
StandardHours               11760
StockOptionLevel            11760
TotalWorkingYears           11760
TrainingTimesLastYear       11760
WorkLifeBalance             11760
YearsAtCompany              11760
YearsInCurrentRole          11760
YearsSinceLastPromotion     11760
YearsWithCurrManager        11760
dtype: int64

the memory_usage method provides us with the memory used by each column in our new DataFrame. To get the entire amount of consumed memory, we can just add each item together.

memory_used = empl_data.memory_usage().sum()
print(f'The employee DataFrame is using {memory_used} bytes in memory.')

The employee DataFrame is using 411680 bytes in memory.

We are using approximately 411kb of system memory to store the DataFrame. Scroll back up to the output of the .info() method. This actually contained the memory usage as well, though the value is slightly less. Why? The memory_usage method includes the memory used by the index of the DataFrame.

But there's still more, and to enable a more accurate summary, we'll pass an optional item that accounts for the full usage of contained objects.

memory_used = empl_data.memory_usage(deep=True).sum()
print(f'The employee DataFrame is using {memory_used} bytes in memory.')

The employee DataFrame is using 1177169 bytes in memory.

Nearly 3x the original result. You should have no problem holding this amount, but with much larger data sets and complex analysis, you may want to be aware of memory usage when processing on a local computer.

Summary Statistics

Enough with the background info, let's see the data!

# statistics about the DataFrame's numerical fields
empl_data.describe()

Describe allows us to see descriptive statistics for the numerical Series in the DataFrame Note that non-numeric fields, such as 'Department' are not included in the results.

Very helpful, but I don't want to scroll horizontally. Let's fix that.

# transpose the DataFrame
empl_data.describe().transpose()

Transposing the output of describe allows us to view without scrolling. This works well for this DataFrames with many columns.

Working with DataFrames

We've managed to get the data in to the DataFrame, but how do we now get data out of the DataFrame?

We don't always want the full DataFrame, we may want to select items that only meet certain criteria.

Let's have a look at how to do just that.

# select all the Training counts from last year
empl_data['TrainingTimesLastYear']

0       0
1       3
2       3
3       3
4       3
5       2
6       3
7       2
8       2
9       3
10      5
11      3
12      1
13      2
14      4
15      1
16      5
17      2
18      3
19      3
20      5
21      4
22      4
23      6
24      2
25      3
26      5
27      2
28      4
29      2
       ..
1440    3
1441    2
1442    3
1443    2
1444    4
1445    3
1446    2
1447    4
1448    5
1449    4
1450    2
1451    1
1452    3
1453    2
1454    3
1455    2
1456    2
1457    2
1458    5
1459    2
1460    3
1461    3
1462    2
1463    2
1464    2
1465    3
1466    5
1467    0
1468    3
1469    3
Name: TrainingTimesLastYear, Length: 1470, dtype: int64

The selection returned to us a Series containing all of the Training data. The '..' applied by the Jupyter Notebook condensed the output so we didn't have to see all 1,469 rows.

We'll now store this Series in a new 'training' variable and generate some statistics to help us understand the Training in the organization.

# store the result
trainings = empl_data['TrainingTimesLastYear']

As the result is now stored in memory, it is not returned to the screen. To view it, we'll have to explicitly ask for it.

# view the training variable contents
trainings

0       0
1       3
2       3
3       3
4       3
5       2
6       3
7       2
8       2
9       3
10      5
11      3
12      1
13      2
14      4
15      1
16      5
17      2
18      3
19      3
20      5
21      4
22      4
23      6
24      2
25      3
26      5
27      2
28      4
29      2
       ..
1440    3
1441    2
1442    3
1443    2
1444    4
1445    3
1446    2
1447    4
1448    5
1449    4
1450    2
1451    1
1452    3
1453    2
1454    3
1455    2
1456    2
1457    2
1458    5
1459    2
1460    3
1461    3
1462    2
1463    2
1464    2
1465    3
1466    5
1467    0
1468    3
1469    3
Name: TrainingTimesLastYear, Length: 1470, dtype: int64

Let's now use this to answer the following questions:

How many total trainings took place last year?
What is the average number of trainings employees received?
What is the maximum number of trainings received?
What is the minimum number of trainings received?
For each number of trainings, how many employees are in each group?
How many employees received the minimum number of trainings?

# total number of trainings
trainings.sum()

4115

# total number of employees
trainings.count()

1470

# average number of trainings
trainings.mean()

2.7993197278911564

# maximum number of trainings
trainings.max()

6

# minimum number of trainings
trainings.min()

0

# number of employees in each group
trainings.value_counts()

2    547
3    491
4    123
5    119
1     71
6     65
0     54
Name: TrainingTimesLastYear, dtype: int64

# number of employees not receiving training last year
trainings.value_counts()[0]

54

# percentage of employees not receiving training
print('{:.0f}% of employees did not receive training last year.'.format(((trainings.value_counts()[0])/trainings.count())*100))

4% of employees did not receive training last year.

Section Recap

In this section we covered a lot of ground after just a few simple steps:

Read in a dataset directly from a website
Explored the dataset for understanding of it's contents
Analyzed the amount of memory consumed by this dataset
Generated Summary Statistics of all the numerical fields
Extracted the training column, returning a Series
Evaluated training activity within the organization
Discovered our first insight - 4% haven't received any training!

	Age	DailyRate	DistanceFromHome	Education	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	HourlyRate	JobInvolvement	JobLevel	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
count	1470.000000	1470.000000	1470.000000	1470.000000	1470.0	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	...	1470.000000	1470.0	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000
mean	36.923810	802.485714	9.192517	2.912925	1.0	1024.865306	2.721769	65.891156	2.729932	2.063946	...	2.712245	80.0	0.793878	11.279592	2.799320	2.761224	7.008163	4.229252	2.187755	4.123129
std	9.135373	403.509100	8.106864	1.024165	0.0	602.024335	1.093082	20.329428	0.711561	1.106940	...	1.081209	0.0	0.852077	7.780782	1.289271	0.706476	6.126525	3.623137	3.222430	3.568136
min	18.000000	102.000000	1.000000	1.000000	1.0	1.000000	1.000000	30.000000	1.000000	1.000000	...	1.000000	80.0	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
25%	30.000000	465.000000	2.000000	2.000000	1.0	491.250000	2.000000	48.000000	2.000000	1.000000	...	2.000000	80.0	0.000000	6.000000	2.000000	2.000000	3.000000	2.000000	0.000000	2.000000
50%	36.000000	802.000000	7.000000	3.000000	1.0	1020.500000	3.000000	66.000000	3.000000	2.000000	...	3.000000	80.0	1.000000	10.000000	3.000000	3.000000	5.000000	3.000000	1.000000	3.000000
75%	43.000000	1157.000000	14.000000	4.000000	1.0	1555.750000	4.000000	83.750000	3.000000	3.000000	...	4.000000	80.0	1.000000	15.000000	3.000000	3.000000	9.000000	7.000000	3.000000	7.000000
max	60.000000	1499.000000	29.000000	5.000000	1.0	2068.000000	4.000000	100.000000	4.000000	5.000000	...	4.000000	80.0	3.000000	40.000000	6.000000	4.000000	40.000000	18.000000	15.000000	17.000000

	count	mean	std	min	25%	50%	75%	max
Age	1470.0	36.923810	9.135373	18.0	30.00	36.0	43.00	60.0
DailyRate	1470.0	802.485714	403.509100	102.0	465.00	802.0	1157.00	1499.0
DistanceFromHome	1470.0	9.192517	8.106864	1.0	2.00	7.0	14.00	29.0
Education	1470.0	2.912925	1.024165	1.0	2.00	3.0	4.00	5.0
EmployeeCount	1470.0	1.000000	0.000000	1.0	1.00	1.0	1.00	1.0
EmployeeNumber	1470.0	1024.865306	602.024335	1.0	491.25	1020.5	1555.75	2068.0
EnvironmentSatisfaction	1470.0	2.721769	1.093082	1.0	2.00	3.0	4.00	4.0
HourlyRate	1470.0	65.891156	20.329428	30.0	48.00	66.0	83.75	100.0
JobInvolvement	1470.0	2.729932	0.711561	1.0	2.00	3.0	3.00	4.0
JobLevel	1470.0	2.063946	1.106940	1.0	1.00	2.0	3.00	5.0
JobSatisfaction	1470.0	2.728571	1.102846	1.0	2.00	3.0	4.00	4.0
MonthlyIncome	1470.0	6502.931293	4707.956783	1009.0	2911.00	4919.0	8379.00	19999.0
MonthlyRate	1470.0	14313.103401	7117.786044	2094.0	8047.00	14235.5	20461.50	26999.0
NumCompaniesWorked	1470.0	2.693197	2.498009	0.0	1.00	2.0	4.00	9.0
PercentSalaryHike	1470.0	15.209524	3.659938	11.0	12.00	14.0	18.00	25.0
PerformanceRating	1470.0	3.153741	0.360824	3.0	3.00	3.0	3.00	4.0
RelationshipSatisfaction	1470.0	2.712245	1.081209	1.0	2.00	3.0	4.00	4.0
StandardHours	1470.0	80.000000	0.000000	80.0	80.00	80.0	80.00	80.0
StockOptionLevel	1470.0	0.793878	0.852077	0.0	0.00	1.0	1.00	3.0
TotalWorkingYears	1470.0	11.279592	7.780782	0.0	6.00	10.0	15.00	40.0
TrainingTimesLastYear	1470.0	2.799320	1.289271	0.0	2.00	3.0	3.00	6.0
WorkLifeBalance	1470.0	2.761224	0.706476	1.0	2.00	3.0	3.00	4.0
YearsAtCompany	1470.0	7.008163	6.126525	0.0	3.00	5.0	9.00	40.0
YearsInCurrentRole	1470.0	4.229252	3.623137	0.0	2.00	3.0	7.00	18.0
YearsSinceLastPromotion	1470.0	2.187755	3.222430	0.0	0.00	1.0	3.00	15.0
YearsWithCurrManager	1470.0	4.123129	3.568136	0.0	2.00	3.0	7.00	17.0

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2