GENERATING FATA

Fake dATA

Intro

IBM Watson Data

For many of the early workbooks here, we've stood on the shoulders of others. We simply imported a dataset that was created for the Watson HR Analytics work.

# imports
import pandas as pd

# updated 2019-08-13
# IBM has removed the file from their server

# deptecated code
# read the file 
# url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
# empl_data = pd.read_excel(url)

# read local file for demonstration
file = 'Dropbox/WFA/data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'
empl_data = pd.read_excel(file)
empl_data.head()

This is great to get us started, and gave us a dataset that many others had used - in blog posts, Kaggle competitions, and otherwise. Now, we're ready for more and would like to generate our own dataset for continued development and exploration.

pydbgen

Random Database/Dataframe Generator

github: https://github.com/tirthajyoti/pydbgen read the docs: https://pydbgen.readthedocs.io/en/latest/

from pydbgen documentation: </i>Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?

After all, databases break every now and then and it is safest to practice with a randomly generated one :-)</i>

That sums it up very well - we need data to practice on, and in a safe way. Especially when we're dealing with PII and senstive data, as we are regularly in HR. It's so commonplace that some, unfortunately, are densensitized to the senstitive nature and requirements, and make a blunder posting to an S3 bin or a similiar, but disastrous mistake.

Generating our own fake data protects us from ourselves. pydbgen allows us to do this very quickly, and generates very realistic data.

Installing pydbgen

As of this writing in August 2019, pydbgen is not available on conda (my preferred installation method).

On both Windows and Linux use pip.

# load pydbgen

import pydbgen
from pydbgen import pydbgen

db = pydbgen.pydb()

df = db.gen_dataframe(num=100, fields=['name', 'street_address', 'city', 'state', 'zipcode', 'country', 'company', 'job_title', 'phone', 'ssn', 'email', 'month', 'year', 'weekday', 'date', 'time', 'latitude', 'longitude', 'license_plate'], )

df.head()

pydbgen Summary

Overall, pydbgen is a great, quick way to generate any amount of data. The limitation is primarily the data types that are supported currently. The fields shown above in this example are the extent of fields available as of this writing. These are a great start and for certain situations, these are more than you need. A field such as License Plate is a nice inclusion.

The documentation a little lacking. For example, the documentation (as of this writing) does not mention the 'Domains.txt' file required to generate email addresses. The documentation, however, does point us to Faker - which pydbgen builds upon to generate the fata (fake data). We'll explore Faker in the next section.

Faker

from faker import Faker

fake = Faker()

fake.name()

'Michael Morris'

fake.address()

'6928 Richard Fort Suite 784\nEast Nicole, SC 52141'

fake.

fake_df = pd.DataFrame(columns = ['name', 'ssn'])

name_list = []
ssn_list = []
dob_list = []
address_list = []
city_list = []
state_list = []
country_list = []
postal_list = []
id_list = []
email_list = []
username_list = []


for i in range(1000):
    name_list.append(fake.name())
    ssn_list.append(fake.ssn())
    dob_list.append(fake.date_of_birth())
    address_list.append(fake.street_address())
    city_list.append(fake.city())
    state_list.append(fake.state_abbr())
    country_list.append(fake.country_code())
    postal_list.append(fake.postalcode())
    email_list.append(fake.email())
    id_list.append(fake.random_int())
    username_list.append(fake.user_name())
    
    
    
fake_df['name'] = name_list
fake_df['ssn'] = ssn_list
fake_df['dob'] = dob_list
fake_df['address'] = address_list
fake_df['city'] = city_list
fake_df['state'] = state_list
fake_df['country'] = country_list
fake_df['postal'] = postal_list
fake_df['id'] = id_list
fake_df['email'] = email_list
fake_df['username'] = username_list

fake_df

Customizing with `Faker`

Faker allows for the creation of your own providers.

from faker.providers import BaseProvider
import random

# create the provider. The class name for Faker must be 'Provider'
class Provider(BaseProvider):
    def gender(self):
        num = random.randint(0,1)
        if num == 0:
            return 'Male'
        else:
            return 'Female'

fake.add_provider(Provider)

fake.gender()

# add this to the DataFrame
gender_list = []

for i in range(1000):
    gender_list.append(fake.gender())

fake_df['gender'] = gender_list

fake_df['gender'].head()

fake_df.info()

# convert gender column to category
fake_df['gender'] = fake_df['gender'].astype('category')

fake_df.info()

fake_df['gender'].head()

#export the file
fake_df.to_csv('~/Downloads/FATA.csv')

Conclusion

This tutorial showed a few ways in which we can generate fake data - FATA - to allow us to continue to explore and analyze HR data. You could combine this to anonymize your real HR data, to be able to include names, ssn's, etc. all without compromising one of the most fundamental parts of working with HR data - privacy and respect of people's information.

	Unnamed: 0	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	...	1	80	0	8	0	1	6	4	0	5
1	1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	...	4	80	1	10	3	3	10	7	1	7
2	2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	...	2	80	0	7	3	3	0	0	0	0
3	3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	...	3	80	0	8	3	3	8	7	3	0
4	4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	...	4	80	1	6	3	3	2	2	2	2

	name	street_address	city	state	zipcode	country	company	job_title	phone-number	ssn	email	month	year	weekday	date	time	latitude	longitude	license-plate
0	William Anderson	7953 Mandy Turnpike	Swain	Texas	78280	Myanmar	Brown-Vasquez	Scientific laboratory technician	352-368-1239	228-58-3135	WAnderson@datapluspeople.com	None	1999	Monday	2002-11-17	10:39:23	-6.9832325	-30.181752	BIH-274
1	Dawn Molina	554 Heather Turnpike Apt. 311	Pepin	Oklahoma	75571	Malaysia	Martinez, Thomas and Henry	Chartered accountant	245-361-8447	252-39-2457	Dawn.Molina@datapluspeople.com	None	1990	Friday	2015-10-26	01:48:58	41.9638895	-33.070358	EYV-268
2	Timothy Alexander	6693 Donald Plain	Moore	Delaware	94146	Chad	Diaz-Bruce	Camera operator	701-463-6626	602-26-0601	Alexander_Timothy67@datapluspeople.com	None	1991	Saturday	2009-12-31	01:51:05	-46.888624	-32.441572	AAN-6293
3	Bradley Walter	24543 Adams Fort	Sydney	Indiana	33266	Ecuador	Jackson-Lang	Company secretary	420-550-7054	563-67-3139	BradleyWalter94@datapluspeople.com	None	1970	Thursday	1979-02-10	22:24:59	-7.668391	-166.274743	8QSM719
4	Daniel Allen	9189 Cynthia Ramp	Noblestown	Kentucky	76651	France	Glass PLC	Biochemist, clinical	538-078-0566	533-98-1206	Daniel.Allen@datapluspeople.com	None	1978	Sunday	1978-06-02	15:35:34	-24.511857	-35.220806	CTZ-3918