Washtub

Documentation

Introduction

Washtub prevents data leaks and theft by anonymizing production data for other uses. Designed specifically for developers, Washtub is built around a simple command line interface that allows developers to pull clean data from production without personally identifiable information residing on their local machine.

The basic steps involved are:

  1. Register and create an account.
  2. Initialize your database on Washtub.
  3. Wash the database.
  4. Download the cleaned, anonymized, masked data.

Heroku Integration

Washtub aims to be extremely easy for developers to use. To that end, a Heroku CLI plugin allows pulling cleaned data for local development with just a single command. This new command will replace your existing usage of pg:pull to ensure only anonymized data resides on your local machine.

To enable local command line usage with Heroku, you'll need to set an environment variable with a Washtub API token and install the Heroku CLI. You may obtain your Washtub API token from your Washtub dashboard after creating an account.

First, set your token:

$ heroku config:set WASHTUB_TOKEN=your_washtub_token
Setting WASHTUB_TOKEN and restarting ⬢ your-app-name... done, v999
WASHTUB_TOKEN: your_washtub_token

Next, install the plugin:

$ heroku plugins:install heroku-washtub
yarn add v1.6.0
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
[4/4] Building fresh packages...
Done in 10.44s.
Installing plugin heroku-washtub... done

Use the plugin to initialize your database on Washtub. You may optionally pass a specific database by using the Heroku database URL name. For example, HEROKU_POSTGRESQL_COBALT_URL. Washtub will default to your DATABASE_URL without an argument.

$ heroku washtub:init
Initializing washtub for your database DATABASE_URL... Done.

After initializing your database, you can confirm and set your wash strategies in the Washtub web dashboard. Read more about Initialization for an overview of the process.

Once you have initialized your database and confirmed your wash strategies through the web dashboard, you may wash your database and download a copy for local development in a single step. The CLI params are the same as that of heroku pg:pull. Specify the Heroku database name along with your local Postgres database name.

$ heroku washtub:wash DATABASE_URL statusgator_development

Initialization

The Washtub database initialization process involves several key steps:

  1. Connect on a read-only basis to your production database
  2. Import your schema: table names and column names and types
  3. Suggest a wash strategy for each column
  4. Disconnect from your database without making any changes

It's important to note that no data is modified on your running database. Only the schema is read to suggest washing strategies. Later, when washing your database, a new copy is made using a backup and that copy is anonymized and exported to your database for local development.

The initialization process will ingest your schema and make recommendations on appropriate washing strategies to use for each column. A washing strategy is a method of manipulating data in a given column using obfuscation, randomization or other modification to anonymize data.

To see the suggested strategies and confirm their use, sign into Washtub and visit your dashboard where the suggested strategies will be displayed for you to review:

Only strategies that you confirm are used. If you do not confirm any suggested strategies, or manually set some on your own, then no anonymization will be performed. To confirm, simply review the list, choose different strategies as needed, and save your changes.

Strategies

Random Email

Generates a sequential email address in the form user#{n}@testdomain.com. Email addresses are guaranteed to be unique within a table.

Random SSN

Generates a string consisting of 9 random integers. These social security numbers are not masked so they take the form XXXXXXXXX rather than XXX-XX-XXXX.

Random First Name

Utilizes Ffaker to generate a random first (given) name. Will use both male and female names from the Ffaker library. Not guaranteed to be unique.

Random Last Name

Utilizes Ffaker to generate a random last (family) name. Not guaranteed to be unique.

Random Company Name

Utilizes Ffaker to generate a random company name. Not guaranteed to be unique.

Nullify

Useful for stripping unneeded data, this strategy will nullify every row of the chosen column.

Reinitialization

Database initialization is an idempotent process. You can run the initialization process over again on your database and it will generate the same suggestions. If you change your schema by adding or changing tables or columns, you can reinitialize and any new tables or columns will be given updated strategy suggestions.

If you override a strategy, a new one will not be suggested. Only columns with no strategy chosen or with a suggested strategy will be analyzed for appropriate was strategies.