This post is part of the series Using Python with HR data
Beginner’s Guide to Using Python with HR Data | Exploration Series
Part Zero – The Basics
In this first tutorial series, I’m exploring the IBM HR Attrition and Performance data set. This is a great data set used to demonstrate the possibilities from using machine learning and other data science techniques.
I’ll be back with tutorial posts that walk through how to apply more advanced techniques to generate predictive and prescriptive insights from the data. But that’d be jumping ahead. First, the basics. Exploratory Data Analysis, or EDA.
It’s often tempting to jump right in and try to find the most advanced insight possible. When I’m in the process of learning something new, it’s my first instinct to begin applying it straight away, skipping the basics. Eventually, I’ll stumble; and it’s always something I could have avoided by simply spending a little bit of time really understanding the data I have.
To properly analyze data, you must understand it. Is it complete (missing values), are the errors (values out of normal bounds – is this correct), and generally what information is contained within the data? Depending on where the request is coming from in a work-context, you may not control the data, so what you get is what you have; it’s often much easier when you’ve pulled your own data – it’s just not always possible, or even smart to do so.
Always begin with an exploration of your data. In this tutorial, I’m digging out my current favorite tool – Python. If you’ve never programmed, if Excel still frightens you a bit, or you’re firmly in the R camp – read on; this series will show the possibilities while exploring 5 different packages and interpreting and understanding data.
Series Outline
0: basic operations & summary statistics
1: matplotlib
2: pandas visualization
3: seaborn
4: plotly
5: series summary
0: basic operations & generating summary statistics
view on github