002 What Hot Models Look Like – For the Love of Data

Summary:
Hot models…data models that is. A survey of many of the most popular data modeling approaches in the news today. Third Normal Form, Anchor Modeling, Data Vault, Data Lakes, Data Swamps. What do they do well, what do they do badly, and which is the one true data model to rule them all? (Hint: it depends, as usual.)
Third Normal Form (3NF) (a.k.a. Naomi Sims)
History: E.F. Codd defined 3NF in 1971 while working at IBM.
Basic Concept:
“The Key, the Whole Key, and Nothing but the Key” -Bill Kent
The gold standard for purist relational database design. If a table has the following characteristics:
  1. 1NF – a) Values in a particular field must be atomic and b) a single row cannot have repeating groups of attributes
  2. 2NF – in addition to being in 1NF, all non-key attributes of the table depend on the primary key
  3. There is no transitive functional dependency
Pros:
  • A battle-tested, well-understood modeling approach that is extremely useful for transactional (OLTP) applications
  • Easy to insert, update, delete data because of referential integrity
  • Avoids redundancy, requiring less space and less points of contact for data changes
  • Many software tools exist to automatically create, reverse engineer, and analyze databases according to 3NF
  • Writing to a 3NF DB is very efficient
Cons:
  • Reading from a DB in 3NF is not as efficient
  • Not as easily accessed by end-users because of the increased number of joins
  • More difficult to produce analytics (trends, period-to-date aggregations, etc.)
  • Many times even transactional systems are slightly de-normalized from 3NF for performance or audit-ability
  • Some people feel that 3NF is no longer as appropriate in an era of cheap storage, incredibly fast computing, and APIs
 
3nf
Source: ewebarchitecture.com
 
Anchor Modeling (incorporates Sixth Normal Form [6NF]) (a.k.a. Gisele Bundchen)
History: Created in 2004 in Sweden
Basic Concepts: Mimics a temporal database
  • anchors – entities or events
    • Example: A person
  • attributes – properties of anchors
    • Example: A person’s name; can be historical, such as favorite color)
  • ties – relationships between anchors
    • Example: Siblings
  • knots – shared properties, such as states or reference tables – combination of an anchor and a single attribute (no history)
    • Example: Gender – only male/female
Pros:
  • Incremental change approach – previous versions of a schema are always encompassed in new changes, so backwards compatibility is always preserved
  • Reduced storage requirements by using knots
Cons:
  • Many entities are created in the database
  • Joins become very complex; hard for end user to understand model
  • Daunting for new technical resources to come up to speed initially
 
anchor
Source: bifuture.blogspot.com
 
 
Data Vault (DV) (a.k.a. Heidi Klum)
History: Dan Linstedt developed the started implementing data vaults in 1990 and published the first version of the methodology (DV1.0) in 2000. He published an updated version (DV2.0) in 2013. The methodology is proprietary and Dan restricts who can train others by maintaining a copyright on the methodology and requiring people who train others to be Data Vault certified. You can still implement data vaults; you just cannot train others on it without being certified.
Basic Concept:
“A single version of the facts (not a single version of the truth)”
“All the data, all the time” – Dan Linstedt
The data fault consists of three primary structures and supporting structures such as reference tables and point-in-time bridge tables. The three main structures are:
  1. Hubs – a list of unique business keys that change infrequently with no other descriptive attributes (except for meta data about load times and data sources). A good example of this is a car or a driver.
  2. Links – relationships or transactions between hubs. These only define the link between entities and can easily support many-to-many relationships; again no descriptive attributes on these tables other than a few meta-attributes. An example of this would be a link between cars and their drivers.
  3. Satellites – Satellites may attach to hubs or links and are descriptive attributes about the entity to which they connect. A satellite for a car hub could describe the year, make, model, current value, etc. These often have some sort of effective dating.
General best practices:
  • Separate attributes from different source systems into their own satellites, at least in a raw data vault. Using this approach it may be common to have a raw data vault that contains source system specific information with all history and attributes maintained and a second downstream business data vault. The business data vault will contain only the relevant attributes, history, or merged data sets that have meaning to the users of that vault.
    • Having a raw mart allows you to preserve all historical data and rebuild the business vault if needs change without having to go back to source systems and without losing data if it is no longer available the source system.
  • Track all changes to all elements so that your data vault contains a complete history of all changes.
  • Start small with a few sources and grow over time. You don’t have to adopt a big bang approach and you can derive value quickly.
  • It is acceptable to add new satellites when changes occur in the source system. This allows you to iteratively develop your ETL without breaking previous ETL routines already created and tested.
DV2.0 – DV1.0 was merely the model. DV2.0 is:
  • An updated modeling approach. Key changes include:
    • Numeric IDs are replaced with hash values, created in the staging area, that support better integration with NoSQL repositories
    • Because hashes are used, you can parallelize data loads even further because you do not have to lookup a surrogate ID if you have the business key to hash from when you’re bringing in data. This means you can load hubs, links, and satellites at the same time in some cases
    • Referential integrity is disabled during loading
  • Recommended architectures around staging areas, marts, virtualization, and NoSQL
  • Additional methodology recommendations around Agile, Sixth Sigma, CMMI, TQM, etc.
Pros:
  • Preserves all data, all the time – this provides the capability for tremendous analysis and responding to changing business needs. The approach allows you to obtain data from multiple sources iteratively and rapidly, preserving backwards compatibility
  • Works extremely well with massively parallel processing (MPP) databases and hardware
  • Can be loaded extremely rapidly, particularly using the DV2.0 modeling approach
  • Lends itself very well to ETL and DW automation/virtualization
  • DV2.0 covers a wide spectrum of modeling needs from staging and marts to methodology
Cons:
  • The data model can spawn a lot of tables and make queries very complicated very quickly.
  • The raw data mart is really not meant for end users to query/explore directly
  • Iterative additions make the data model more complicated
  • Although storage may be cheap, keeping all changes for all data in all sources can lead to data sprawl. This also makes a pared down information mart almost a necessity.
  • Raw DV data is not cleansed and data from multiple sources are not blended when being stored
 dv
Data Lake (DL) (a.k.a. Brooklyn Decker)
History: Term was coined by Pentaho CTO James Dixon in a blog post in 2010 referring to Pentaho’s data architecture approach to storing data in Hadoop.
Basic Concept: A massive, big data repository, typically on Hadoop or HDFS, at least. Key points are that it is:
  1. Schema-less – data is written to the lake in its raw form without cleansing
  2. Ingests different types of data (relational, event-based, documents, etc.) in batch and/or real-time streaming
  3. Automated meta data management – a best practice is to use tools to automatically catalog meta data to track available attributes, last access times, data lineage, and data quality
  4. Typically multiple products are used to load data into and read data from the lake
  5. Rapid ability to ingest new data sources
  6. Typically only a destination; it is usually not a source from which operational systems will source data
Pros:
  • Useful when you do not know what attributes will be needed or used.
  • Schema on Read – can ingest any type of data and allow different users to assess value during analysis
  • Extremely large scale at low to moderate cost
  • Can and will use a variety of tools/technologies to analyze/visualize/massage data into a useful form
Cons:
  • Can me seen as a vast wasteland of disorganized data, particularly without good meta data
  • Consumers must understand raw data in various systems to know how to integrate and cleanse it in order to derive meaningful information
  • High likelihood that different consumers will perform very similar operations to retrieve data (i.e., overlap and duplication of efforts). Slight differences between groups can lead to reconciling differences
  • Uncleansed data and multiple versions of the same data may possibly lead to duplication if not handled/filtered carefully
  • It isn’t SQL – Some users will have to use more than just SQL to derive useful information from data
    • Offloading ETL can require significant rework of existing processes to move to something like Hive
  • Using multiple tool sets can lead to training and supportability challenges if not governed properly
  • Data curation can by very challenging
 datalake
 
Data Swamp (DS) (a.k.a. Tyra Banks)
History: I’m not including a lot of history here, because this is really an extension of a Data Lake (gone bad).
Basic Concept: A data swamp is a data lake that has been poorly maintained or documented, lacks meta data, or has so much raw data that you don’t know where to start for insights. Or, it could be a combination of several of those points. When you start tracking tons of data from all different sources, but you don’t know who is using what, how to merge data sets, or how to use most of the data in your “data lake”, you’ve really got a data swamp.
Pros:
  • Hey, you must’ve done something right to get all that data into the repository…?
  • At least you haven’t lost data that you can’t go back and get.
  • If it were easy, everyone would be doing it 🙂
Cons:
  • You’ve likely spent a lot of time and effort putting in a data lake/HDFS/Hadoop/Hive/etc. and you’re struggling to operate it at scale or to answer the questions you set out to answer.
  • You need meta data to clue users into what is most useful, relevant, or recent
  • You probably need to look into key use cases (low hanging fruit) and start from that point as a place to begin using/resuscitating your repository.
*** The assignment of model names to each data model was an incredibly (un)scientific process of googling various terms like “most famous supermodel <year>”, “<year> top supermodel”, etc. and teasing out the most likely #1. Feel free to disagree and let me know your vote and how you obtained it.