For many of the early workbooks here, we've stood on the shoulders of others. We simply imported a dataset that was created for the Watson HR Analytics work.
# imports
import pandas as pd
# updated 2019-08-13
# IBM has removed the file from their server
# deptecated code
# read the file
# url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
# empl_data = pd.read_excel(url)
# read local file for demonstration
file = 'Dropbox/WFA/data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'
empl_data = pd.read_excel(file)
empl_data.head()
This is great to get us started, and gave us a dataset that many others had used - in blog posts, Kaggle competitions, and otherwise. Now, we're ready for more and would like to generate our own dataset for continued development and exploration.
github: https://github.com/tirthajyoti/pydbgen read the docs: https://pydbgen.readthedocs.io/en/latest/
from pydbgen documentation: </i>Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?
After all, databases break every now and then and it is safest to practice with a randomly generated one :-)</i>
That sums it up very well - we need data to practice on, and in a safe way. Especially when we're dealing with PII and senstive data, as we are regularly in HR. It's so commonplace that some, unfortunately, are densensitized to the senstitive nature and requirements, and make a blunder posting to an S3 bin or a similiar, but disastrous mistake.
Generating our own fake data protects us from ourselves. pydbgen allows us to do this very quickly, and generates very realistic data.
conda
(my preferred installation method).
On both Windows and Linux use pip
.
# load pydbgen
import pydbgen
from pydbgen import pydbgen
db = pydbgen.pydb()
df = db.gen_dataframe(num=100, fields=['name', 'street_address', 'city', 'state', 'zipcode', 'country', 'company', 'job_title', 'phone', 'ssn', 'email', 'month', 'year', 'weekday', 'date', 'time', 'latitude', 'longitude', 'license_plate'], )
df.head()
The documentation a little lacking. For example, the documentation (as of this writing) does not mention the 'Domains.txt' file required to generate email addresses. The documentation, however, does point us to Faker
- which pydbgen
builds upon to generate the fata (fake data). We'll explore Faker
in the next section.
from faker import Faker
fake = Faker()
fake.name()
fake.address()
fake.
fake_df = pd.DataFrame(columns = ['name', 'ssn'])
name_list = []
ssn_list = []
dob_list = []
address_list = []
city_list = []
state_list = []
country_list = []
postal_list = []
id_list = []
email_list = []
username_list = []
for i in range(1000):
name_list.append(fake.name())
ssn_list.append(fake.ssn())
dob_list.append(fake.date_of_birth())
address_list.append(fake.street_address())
city_list.append(fake.city())
state_list.append(fake.state_abbr())
country_list.append(fake.country_code())
postal_list.append(fake.postalcode())
email_list.append(fake.email())
id_list.append(fake.random_int())
username_list.append(fake.user_name())
fake_df['name'] = name_list
fake_df['ssn'] = ssn_list
fake_df['dob'] = dob_list
fake_df['address'] = address_list
fake_df['city'] = city_list
fake_df['state'] = state_list
fake_df['country'] = country_list
fake_df['postal'] = postal_list
fake_df['id'] = id_list
fake_df['email'] = email_list
fake_df['username'] = username_list
fake_df
Faker
Faker
allows for the creation of your own providers.
from faker.providers import BaseProvider
import random
# create the provider. The class name for Faker must be 'Provider'
class Provider(BaseProvider):
def gender(self):
num = random.randint(0,1)
if num == 0:
return 'Male'
else:
return 'Female'
fake.add_provider(Provider)
fake.gender()
# add this to the DataFrame
gender_list = []
for i in range(1000):
gender_list.append(fake.gender())
fake_df['gender'] = gender_list
fake_df['gender'].head()
fake_df.info()
# convert gender column to category
fake_df['gender'] = fake_df['gender'].astype('category')
fake_df.info()
fake_df['gender'].head()
#export the file
fake_df.to_csv('~/Downloads/FATA.csv')
This tutorial showed a few ways in which we can generate fake data - FATA - to allow us to continue to explore and analyze HR data. You could combine this to anonymize your real HR data, to be able to include names, ssn's, etc. all without compromising one of the most fundamental parts of working with HR data - privacy and respect of people's information.