002 What Hot Models Look Like – For the Love of Data

Summary:

Hot models…data models that is. A survey of many of the most popular data modeling approaches in the news today. Third Normal Form, Anchor Modeling, Data Vault, Data Lakes, Data Swamps. What do they do well, what do they do badly, and which is the one true data model to rule them all? (Hint: it depends, as usual.)

Third Normal Form (3NF) (a.k.a. Naomi Sims)

History: E.F. Codd defined 3NF in 1971 while working at IBM.

Basic Concept:

“The Key, the Whole Key, and Nothing but the Key” -Bill Kent

The gold standard for purist relational database design. If a table has the following characteristics:

1NF – a) Values in a particular field must be atomic and b) a single row cannot have repeating groups of attributes
2NF – in addition to being in 1NF, all non-key attributes of the table depend on the primary key
There is no transitive functional dependency

Pros:

A battle-tested, well-understood modeling approach that is extremely useful for transactional (OLTP) applications
Easy to insert, update, delete data because of referential integrity
Avoids redundancy, requiring less space and less points of contact for data changes
Many software tools exist to automatically create, reverse engineer, and analyze databases according to 3NF
Writing to a 3NF DB is very efficient

Cons:

Reading from a DB in 3NF is not as efficient
Not as easily accessed by end-users because of the increased number of joins
More difficult to produce analytics (trends, period-to-date aggregations, etc.)
Many times even transactional systems are slightly de-normalized from 3NF for performance or audit-ability
Some people feel that 3NF is no longer as appropriate in an era of cheap storage, incredibly fast computing, and APIs

Source: ewebarchitecture.com

(Sources:https://en.wikipedia.org/wiki/Third_normal_form; http://www.1keydata.com/database-normalization/third-normal-form-3nf.php; http://www.island-data.com/downloads/papers/normalization.html; http://www.johndcook.com/blog/2012/11/21/rise-and-fall-of-the-third-normal-form/)

Anchor Modeling (incorporates Sixth Normal Form [6NF]) (a.k.a. Gisele Bundchen)

History: Created in 2004 in Sweden

Basic Concepts: Mimics a temporal database

anchors – entities or events
- Example: A person
attributes – properties of anchors
- Example: A person’s name; can be historical, such as favorite color)
ties – relationships between anchors
- Example: Siblings
knots – shared properties, such as states or reference tables – combination of an anchor and a single attribute (no history)
- Example: Gender – only male/female

Pros:

Incremental change approach – previous versions of a schema are always encompassed in new changes, so backwards compatibility is always preserved
Reduced storage requirements by using knots

Cons:

Many entities are created in the database
Joins become very complex; hard for end user to understand model
Daunting for new technical resources to come up to speed initially

Source: bifuture.blogspot.com

(Sources: Wikipedia,http://en.wikipedia.org/w/index.php?title=Anchor_Modeling& oldid =570899764; http://www.anchormodeling.com/)

Data Vault (DV) (a.k.a. Heidi Klum)

History: Dan Linstedt developed the started implementing data vaults in 1990 and published the first version of the methodology (DV1.0) in 2000. He published an updated version (DV2.0) in 2013. The methodology is proprietary and Dan restricts who can train others by maintaining a copyright on the methodology and requiring people who train others to be Data Vault certified. You can still implement data vaults; you just cannot train others on it without being certified.

Basic Concept:

“A single version of the facts (not a single version of the truth)”

“All the data, all the time” – Dan Linstedt

The data fault consists of three primary structures and supporting structures such as reference tables and point-in-time bridge tables. The three main structures are:

Hubs – a list of unique business keys that change infrequently with no other descriptive attributes (except for meta data about load times and data sources). A good example of this is a car or a driver.
Links – relationships or transactions between hubs. These only define the link between entities and can easily support many-to-many relationships; again no descriptive attributes on these tables other than a few meta-attributes. An example of this would be a link between cars and their drivers.
Satellites – Satellites may attach to hubs or links and are descriptive attributes about the entity to which they connect. A satellite for a car hub could describe the year, make, model, current value, etc. These often have some sort of effective dating.

Source: http://www.slideshare.net/kgraziano/introduction-to-data-vault-modeling

General best practices:

Separate attributes from different source systems into their own satellites, at least in a raw data vault. Using this approach it may be common to have a raw data vault that contains source system specific information with all history and attributes maintained and a second downstream business data vault. The business data vault will contain only the relevant attributes, history, or merged data sets that have meaning to the users of that vault.
- Having a raw mart allows you to preserve all historical data and rebuild the business vault if needs change without having to go back to source systems and without losing data if it is no longer available the source system.
Track all changes to all elements so that your data vault contains a complete history of all changes.
Start small with a few sources and grow over time. You don’t have to adopt a big bang approach and you can derive value quickly.
It is acceptable to add new satellites when changes occur in the source system. This allows you to iteratively develop your ETL without breaking previous ETL routines already created and tested.

DV2.0 – DV1.0 was merely the model. DV2.0 is:

An updated modeling approach. Key changes include:
- Numeric IDs are replaced with hash values, created in the staging area, that support better integration with NoSQL repositories
- Because hashes are used, you can parallelize data loads even further because you do not have to lookup a surrogate ID if you have the business key to hash from when you’re bringing in data. This means you can load hubs, links, and satellites at the same time in some cases
- Referential integrity is disabled during loading
Recommended architectures around staging areas, marts, virtualization, and NoSQL
Additional methodology recommendations around Agile, Sixth Sigma, CMMI, TQM, etc.

Pros:

Preserves all data, all the time – this provides the capability for tremendous analysis and responding to changing business needs. The approach allows you to obtain data from multiple sources iteratively and rapidly, preserving backwards compatibility
Works extremely well with massively parallel processing (MPP) databases and hardware
Can be loaded extremely rapidly, particularly using the DV2.0 modeling approach
Lends itself very well to ETL and DW automation/virtualization
DV2.0 covers a wide spectrum of modeling needs from staging and marts to methodology

Cons:

The data model can spawn a lot of tables and make queries very complicated very quickly.
The raw data mart is really not meant for end users to query/explore directly
Iterative additions make the data model more complicated
Although storage may be cheap, keeping all changes for all data in all sources can lead to data sprawl. This also makes a pared down information mart almost a necessity.
Raw DV data is not cleansed and data from multiple sources are not blended when being stored

(Sources: https://en.wikipedia.org/wiki/Data_Vault_Modeling;http://roelantvos.com/blog/?p=1063; https://kentgraziano.files.wordpress.com/2012/02/introduction-to-data-vault-modeling.pdf)

Data Lake (DL) (a.k.a. Brooklyn Decker)

History: Term was coined by Pentaho CTO James Dixon in a blog post in 2010 referring to Pentaho’s data architecture approach to storing data in Hadoop.

Basic Concept: A massive, big data repository, typically on Hadoop or HDFS, at least. Key points are that it is:

Schema-less – data is written to the lake in its raw form without cleansing
Ingests different types of data (relational, event-based, documents, etc.) in batch and/or real-time streaming
Automated meta data management – a best practice is to use tools to automatically catalog meta data to track available attributes, last access times, data lineage, and data quality
Typically multiple products are used to load data into and read data from the lake
Rapid ability to ingest new data sources
Typically only a destination; it is usually not a source from which operational systems will source data

Pros:

Useful when you do not know what attributes will be needed or used.
Schema on Read – can ingest any type of data and allow different users to assess value during analysis
Extremely large scale at low to moderate cost
Can and will use a variety of tools/technologies to analyze/visualize/massage data into a useful form

Cons:

Can me seen as a vast wasteland of disorganized data, particularly without good meta data
Consumers must understand raw data in various systems to know how to integrate and cleanse it in order to derive meaningful information
High likelihood that different consumers will perform very similar operations to retrieve data (i.e., overlap and duplication of efforts). Slight differences between groups can lead to reconciling differences
Uncleansed data and multiple versions of the same data may possibly lead to duplication if not handled/filtered carefully
It isn’t SQL – Some users will have to use more than just SQL to derive useful information from data
- Offloading ETL can require significant rework of existing processes to move to something like Hive
Using multiple tool sets can lead to training and supportability challenges if not governed properly
Data curation can by very challenging

– Source: http://www.slideshare.net/ImpetusInfo/enterprise-big-data-lake-challenges-strategies-maximizing-benefits-impetus-webinar-44720024

(Sources: http://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/; http://knowledgent.com/whitepaper/design-successful-data-lake/; http://martinfowler.com/bliki/DataLake.html; https://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_ Datalake_White-Paper_ 20140410.pdf; http://cacm.acm.org/blogs/blog-cacm/181547-why-the-data-lake-is-really-a-data-swamp/fulltext)

Data Swamp (DS) (a.k.a. Tyra Banks)

History: I’m not including a lot of history here, because this is really an extension of a Data Lake (gone bad).

Basic Concept: A data swamp is a data lake that has been poorly maintained or documented, lacks meta data, or has so much raw data that you don’t know where to start for insights. Or, it could be a combination of several of those points. When you start tracking tons of data from all different sources, but you don’t know who is using what, how to merge data sets, or how to use most of the data in your “data lake”, you’ve really got a data swamp.

Pros:

Hey, you must’ve done something right to get all that data into the repository…?
At least you haven’t lost data that you can’t go back and get.
If it were easy, everyone would be doing it 🙂

Cons:

You’ve likely spent a lot of time and effort putting in a data lake/HDFS/Hadoop/Hive/etc. and you’re struggling to operate it at scale or to answer the questions you set out to answer.
You need meta data to clue users into what is most useful, relevant, or recent
You probably need to look into key use cases (low hanging fruit) and start from that point as a place to begin using/resuscitating your repository.

*** The assignment of model names to each data model was an incredibly (un)scientific process of googling various terms like “most famous supermodel <year>”, “<year> top supermodel”, etc. and teasing out the most likely #1. Feel free to disagree and let me know your vote and how you obtained it.

Podcast: Play in new window | Download

Share this:

Related