GDPR: Use only the data you need

GDPR: Use only the data you need

As the dust on GDPR has settled, the conflict grows as to the balance of finding insights while maintaining data privacy. Martin Fowler writes about Datensparsamkeit which he loosely defines as “data frugality”. Anyone who’s already dealt with privacy laws in Germany can relate, but with the onset of GDPR and the growing concerns of the ethical obligations and bounds of data usage, this article introduces apt compromises one can consider to strike a proper balance.

There are many approaches to maintaining the privacy of associate information and still achieving analytical goals:

  1. Consider aggregating data right as you pull from your HRIS. Doing so removes any potential risk of exposure. A well-defined objective sets the level of detail properly at the outset.
  2. Allow anonymous survey responses to remain anonymous. Strip away any identifiable information straight away. It also avoids the pressure when a Director, even one in HR, asks to see the responses from their reports.
  3. Hat tip to Martin Fowler for his idea:

    Datensparsamkeit suggests that you shouldn’t store the IP address directly, perhaps instead you should hash it and only store the hash.

    consider applying hashes to personally-identifiable information that still can be used in analysis, but in a safely anonymized form.

As I’ve spent more and more time using HR data, I’ve grown more comfortable with less. Having worked with many business teams, notably marketing, that thrive on ‘more is more’ – with people data less is often appropriate. Firstly the goal is to respect the data of actual people, which is becoming more and more rare, and after that – remain legally compliant. You can do both.

5 more reasons not to use Excel for People Analytics

5 more reasons not to use Excel for People Analytics

Chief Financial Officers are now demanding their teams stop using Excel. While your C-level executives may not be demanding this of you, there are very good reasons to consider alternatives. If Finance is ready to abandon Excel, HR should certainly make the jump. Seriously, have you ever seen what a Financial Analyst builds in Excel? If not, well, just be glad.

1 Excel doesn’t do Big Data

Excel tops out at 1,048,576 rows. I believe that the majority of HR departments do not have Big Data… yet. To HR generally, ~1 million rows may feel like huge data, but it does not meet today’s definition of Big Data. In fact, that’s no where close.

Excel supports 16,834 columns in a worksheet. Personally, I’ve never seen any data nearly as wide as 16,000+ columns – and I never, ever want to.

I’m willing to wager a large sum that your HR data is not going to come in a wide format, but rather a long one. When your data is in a long format, even HR data  of a mid-sized organization will surpass the ~1 million row limit.

Headcount is a simple example. Let’s consider a few reasonable examples and see when we max out of Excel.

  • Assume you have 40,000 active employees. If you have 25 years of history, you’ll have hit your limit.
  • Assume you have 10,000 employees, but you want to look at this on a monthly basis. You’ll only get 8 years worth of data in Excel.

Yes, you could of course pre-process some of the information. You could have your HCM aggregate and deliver the data. This is certainly reasonable, and even advisable in certain situations. But when you want to slice your information multiple ways – by gender, department, job level – each of those is a separate request for data. Most data analysis and visualization tools work best with granular data, that you control the various aggregations from. I’ve never found a case where I didn’t benefit more from having more granular-level information. Oh, except for when using Excel…

2 I don’t like Excel graphing.

Honestly, I hate Excel graphs. This is my least favorite part of using Excel. I feel like a data visualization failure when I use Excel. I can perform advanced table calculations in Tableau, build interactive Python and R visualizations, and write complex database queries; yet I can’t manage a decent bar graph in Excel. That’s only a slight exaggeration.

Granted, I’ve never put in the time to really master Excel graphing. But I’ve no motivation to. It’s complex, limited, and I’ve already found many better options. Why torture myself further? I’ve seen the light, and it’s glorious outside of Excel.

3 endless calculating


'Calculating 4 processors...'

Oh. my. gosh.

The amount of time I’ve suffered through Excel crunching data. Literally crunching data; leaving my work laptop sounding like it’s grinding something internally. And all I did was add a formula and apply it to the colu… *computer promptly stops responding*.

That’s all it takes to lose your Tuesday afternoon to a seemingly endless cycle of calculations. There are websites and blogs dedicated to speeding up Excel. I say it’s faster to not use Excel at all.

4 repeatability

Every Excel user has had to use this. It’s always at the perfect moment too, just before big presentation, one final tweak … and, NO!, No, no, no; nooooooooo! Yes, Excel has crashed again. You’re left scrambling to recover your workbook.

Sheets get deleted. Formulas altered. And all of this before your data changes. Especially those among us that love to build reports and dashboards in Excel – just watch when their manager asks for the most minor of cosmetic layout alterations. Their face says it all “You just added 8 hours of unmerging, moving, and resizing 4,000 cells because of your request.”

5 accuracy

$6 billion. That’s the amount of money JP Morgan Chase lost in 2012, in large part due to Excel errors.

88%. That’s the amount spreadsheets found to have human errors present. Nearly 9 out of 10.

Those numbers likely speak for themselves. Excel has a great ‘feature – ‘paste as values’. I use it when I want to avoid the dreaded ‘Calculating…’ The downside – there’s absolutely zero evidence of the work. You could record macros, but good luck making quick changes to a macro. If you can do that, I’ll imagine that you’re already writing code elsewhere as well.

Alternatives

There’s a lengthy list, and I’ve plans to cover these in-depth for use in People Analytics.

Open-Source Languages:

Other options:

Each of these has it’s pros/cons. Open-source languages have endless possibilities, but you’ve got to learn to code. Tools such as Tableau and QlikView can cost thousands per license.

Results matter most

I’ll be honest, you can’t, and probably shouldn’t avoid Excel entirely. There’s a right tool for every job. There are jobs that Excel is great, maybe even perfect for.

I hope you’ll check out some of these, keeping an open and curious mind. Check out some of my Tutorials, I hope to convince you through examples more than my words.

There’s also this: the best tool is the one you use.

midnight clock: Photo by Loic Djim on Unsplash

dog: Photo by Matthew Henry on Unsplash

HR Data Isn’t Big Data

HR Data Isn’t Big Data

HR Data is just the right size

Whatever data you have is the data you can work with. If you’re at the beginning or early arc of your Analytics Journey, you’ve enough data, enough for even some of the more advanced analytics. Getting your hands on it, cleaning it, and forming actionable insights from it may be a different story.

Big Data is the term that is used ubiquitously for any reference to analytics. HR data isn’t Big Data.

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy. There are five dimensions to big data known as Volume, Variety, Velocity and the recently added Veracity and Value. – wikipedia

This does not describe HR data. Most organizations considering applying analytics to their data have thousands, if not hundreds, of employees. Combining your historical HCM data with a few other sources, and you may climb above a million records. Excel alone can handle that.

Take advantage of the current state

So HR doesn’t have big data, that doesn’t mean analytics isn’t worth pursuing. On the contrary it makes it ripe for pursuing analytics using the available data sets of today. You can begin your analytics journey without the worry of capital projects and IT teams building out Hadoop clusters for big data processing, that also have to take into account the unique security required governing HR data. None of that is required, you can begin with simple queries of HCM data with tools you’re team already uses, and you’re on the path to insights.

To HR, analytics does seem like big data. HR is typically used to the individual-level transactions – career planning, performance assessments, candidate interviews, compensation reviews. Working with organization-wide data, even subsets of, can feel like big data to HR professionals. It’s certainly a step up from traditional HR practice, and it provides big return for the effort expended.

It won’t stay this way for long

HR data not being classified as Big Data is the state today. With the proliferation of personal activity trackers, organizational network analysis (ONA), and other emerging data collection, HR data will reach the classification of big data in the not too distant future. All the reason to begin your analytics journey now, so you’re ready for the shift as it comes.

HR Data isn’t Big Data. Yet.

data+people 2019 objectives

data+people 2019 objectives

An HR-geared site should have some objectives right? For the year, we’ve lofty goals to kick out and share into the world of people analytics.

First up will be a push to build out the initial round of machine learning posts. A step-by-step approach to building your first few machine learning projects with HR data. This will be done first in Python.

Which brings to our next goal. R. Many of you want to see these items completed in R. And we’ll begin to Port over some of the code and examples into R. This will be a fun learning on our side as well.

Thirdly, more visualization examples. People analytics can mean giving users access to the information and letting them find answers and meaning. There are so many times today to give direct access to users to explore data.

Finally, data and data science education. To the early point of getting information in the hands of interested parties, there should be a push – especially in HR – to understand data, and acquire data literacy and skills to better serve organizations and modern business. Why ‘especially in HR’? HRBPs are great at people-related skills, but we’re bringing data to the last business unit to embrace it and it’s not a skill that is strong in HR… Yet.

Onward…

Beginner’s Guide: Python for Analytics | Seaborn

Beginner’s Guide: Python for Analytics | Seaborn

Beginner’s Guide to Using Python with HR Data | Exploration Series

Part Three – Seaborn

In this first tutorial series, I’m exploring the IBM HR Attrition and Performance data set. This is a great data set used to demonstrate the possibilities from using machine learning and other data science techniques.

Now we’ll move on to using Seaborn for our visualizations. The benefit of Seaborn is it continues to abstract the complex, underlying calls to visualize your data – further allowing you to focus on your analysis task and not having to think about how to implement what you want to do. It goes even further and provides built-in functionality that would be incredibly complex to implement without the benefit of Seaborn.

Series Outline

0: basic operations & summary statistics

1: matplotlib

2: pandas visualization

3: seaborn

4: plotly

5: series summary

3: Seaborn

 

view on github



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Credits
Photo: Photo by Randall Ruiz on Unsplash

Beginner’s Guide: Python for Analytics | Pandas

Beginner’s Guide: Python for Analytics | Pandas

Beginner’s Guide to Using Python with HR Data | Exploration Series

Part Two – Pandas

In this first tutorial series, I’m exploring the IBM HR Attrition and Performance data set. This is a great data set used to demonstrate the possibilities from using machine learning and other data science techniques.

Next, we’ll take a look at the power of Pandas to plot our data. As a budding data [analyst/scientist/enthusiast], Pandas has become my most common import and tool. Plotting directly from pandas objects makes it very easy to stay in the flow of analyzing data. Let’s get going.

Series Outline

0: basic operations & summary statistics

1: matplotlib

2: pandas visualization

3: seaborn

4: plotly

5: series summary

2: Pandas

 

view on github



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.