For the Love of Data

E035 – Your Data and an Announcement

Mon, 03 Dec 2018 03:53:12 +0000

Announcement:

Celebrating FTLOD’s 3 year anniversary this month
Covered a diverse range of topics from BBQ and chocolate to alogorithms and graph databases
Future episodes will be much more ad-hoc and when I come across a topic that is interesting
Please stay subscribed
Please reach out on Twitter or LinkedIn to let me know what your favorite episode has been

The Importance of Your Data

Quotes:
- With great power comes great responsibility -Amazing Spider Man #15

Data Commercialization
- Ford’s CEO recently suggested that the data collected by the company’s financial services arm also represents a valuable, low-overhead asset.¹
- Not just driving data, but also using data from purchase process such as marital status, income, etc.
- However, in desperation to maintain profits, what would some companies do?
- Know how your data is being used.
- Tim Cook recently criticized Google, FB, and others (not by name) of creating a “‘data industrial complex’ in which our personal information ‘is being weaponized against us with military efficiency.’”².
- Talked about the echo chamber that social networks and algorithms can create
- However, this is not all data doomsday
  - Data is helping us achieve better, deeper, faster insights than ever before
  - We are bettering our health, optimizing economies, and identifying connections that we never could have before
  - All this reward comes with some risks that we need to manage and be aware of
Data Breaches
- Marriott disclosed a 500MM record breach. Not the biggest, but it hackers had access since 2014.
- Names, phone numbers, email addresses, passport numbers, date of birth and arrival and departure information. For millions others, their credit card numbers and card expiration dates were potentially compromised.³
What to do to protect yourself if your data is part of a breach:⁴
- Sign up for services like SpyCloud (it is free)
- Change your password – and ideally switch to unique passphrases
- Monitor your accounts for suspicious activity
- Open a separate credit card for online transactions
- Limit the information you share
- Avoid saving credit card information on websites
- Be vigilant

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

E034 – Using Data to Make Perfect Chocolate – Part 2

Wed, 31 Oct 2018 11:38:36 +0000

In the second part of this two-part episode, we do a data deep dive into a decadent vat of chocolate. We talk about various stats and data with Brian Mikiten, former process engineer and founder of Casa Chocolates in San Antonio, TX. We also cover the types of chocolate and how much of chocolate making is an art vs. a science. See part one for the history of chocolate and an overview of how to make it.

Types
- White
- Dark
- Milk
- Ruby – created in 2017 from Ruby cocoa beans by Barry Callebaut in Switzerland
  
  Photo via bakemag.com

Chocolate data
- World Chocolate Day is July 7th. US National Chocolate Day is October 28.
- The United States accounts for 20% of the world’s chocolate consumption.
- On the average Valentine’s Day, nearly $400 million of chocolate is purchased around the world, accounting for 5% of the industry’s total sales.
- 22% of all chocolate consumed between 8pm and midnight.
- Chocolate significantly reduces theta activity in the brain, which is associated with relaxation, which is why we want to eat chocolate when we’re feeling stressed out.
- Myth: Chocolate is high in caffeine (contains ~6mg/bar, same as decaf coffee)
- More than 70% of Americans prefer milk chocolate
- In 2011, Thorntons created the world’s largest chocolate bar, which weighed in at 12,770 lbs. It measured 13 ft. by 13 ft. by 1 ft.
- Top companies by sales (via https://www.icco.org)
- $ / ton by date
- Top 10 World Cocoa Producers
Science / data driven production of chocolate
- Equipment used
- Variables evaluated / controlled
- What is your test process?
Science vs. Art of chocolate making
Bean profiles
Brian’s background and history of Casa Chocolate
What Casa Chocolate’s approach is to making chocolate
Tips for getting started at home
Where people can find out more about Brian and Casa Chocolate

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources

E033 – Using Data to Make Perfect Chocolate – Part 1

Wed, 24 Oct 2018 19:37:23 +0000

In the first part of this two-part episode, we do a data deep dive into a decadent vat of chocolate. We talk about history and how to make chocolate. In part two, we will talk about various stats and data with Brian Mikiten, former process engineer and founder of Casa Chocolates in San Antonio, TX.

History of chocolate
- Evidence dates back as early as 1500 BC
- Fermented beverages date back to 350BC
- Believed to have originated with Mesoamericans
- Made its way to Europe where sugar was added in 16th century
- In 1828, Dutch chemist Coenraad Johannes van Houten used alkaline salts to process into “Dutch cocoa”
- 1847 – J.S. Fry and Sons created the first chocolate bar
- 1876 – Swiss chocolatier Daniel Peter added milk powder to create milk chocolate
- 2/3 of cocoa today is produced in Western Africa
- Fair trade chocolate certifies that chocolate is not gathered with child or slave labor
Overview of chocolate making
- Harvesting – pods contain ~40 cacao beans
  
  Photo via TripAdvisor.com
- Roasting
- Cracking
- Winnowing
- Grinding
- Conching
- Tempering
- Molding

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources

E032 – 2018 State of DevOps Report

Sun, 30 Sep 2018 04:50:10 +0000

Intro
Greg’s Background
Intro to DevOps
Tools you’ve used
Intro to the report & this year vs. previous years
- Feels more general and high-level than some of the previous reports (no MTTR mentioned for instance)
Who took the survey?
- Surveyed over 30,000 people in 7 years (~4,300 / yr)
- Technology is overrepresented – 38% of total respondents
- Energy & Resources was only 2%
- Tech + FS = 50%
- Infosec = only 3% of people
- 29% were dedicated DevOps (14% IT, 15% Dev/Eng)
Keywords
- Data is mentioned 43 times in the report
- Security = 65x
- Agile = 7
- DevOps = 328
- Top 10 words in Word Cloud: (removed Puppet | State of DevOps footer on each page)
  - 245DevOps
  - 210teams
  - 149Stage
  - 123practices
  - 111can
  - 99team
  - 79organizations
  - 75services
  - 73business
  - 69success
- No DataOps, no SecOps or DevSecOps
C-suite seems out of touch with conditions on the ground
- Differences in perception – p. 30
- Sometimes overstate team’s opinion by a factor of 2x
Stages in the report:
- First
  - Stage 0: Build the foundation
- Second
  - Stage 1: Normalize the technology stack
  - Stage 2: Standardize and reduce variability
  - Stage 3: Expand DevOps practices
- Third
  - Stage 4: Automate infrastructure delivery
  - Stage 5: Provide self-service capabilities
CAMS = Culture, Automation, Measurability, Sharing
Principal Industries:
- Top: Tech, Financial Services, Manufacturing/Industry
- Bottom: Non-Profit, Energy/Resources, Media
- Trend: Most to least competition?

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

https://puppet.com/resources/whitepaper/state-of-devops-report
https://devops-research.com/ – DORA – DevOps Research & Assessment, authors of the report
https://www.wordclouds.com/
https://www.linkedin.com/in/gregwwalters/

E031 – Data Collaboration with Cursor

Thu, 30 Aug 2018 23:50:23 +0000

Learn about Cursor, a new platform for collaboration around data, hosted platforms and BI artifacts. I sat down with Adam Weinstein, CEO and Co-Founder of Cursor, to learn about the platform.

About Cursor

Cursor offers a data search and analytics hub that makes disparate data accessible and actionable, enabling technical and business users alike to effortlessly get answers, collaborate and gain insights. Founded by a trio of data leaders from Salesforce, LinkedIn, and Pandora, Cursor’s easy-to-deploy software has been adopted by teams at Apple, Atlassian, Deloitte, Incedo, LinkedIn, NovumRx, and Slack. Cursor is based in San Francisco, CA.

Cursor Press

Topics:

What is Adam’s background?
How BI has evolved over the past 10-20 years.
What some of the most pressing challenges are for organizations today?
What should people being doing today, outside of a specific tool, to get better at collaborating?
How can Cursor help with those challenges?
How is content secured on the platform? (separating data from metadata)
Where can people find out more about Cursor?
What’s next for Cursor as far as features or a roadmap?
What are some tools that Adam can’t live without in your daily work?

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

E030 – July 2018 News Roundup

Tue, 31 Jul 2018 01:49:19 +0000

July 2018 News Roundup

This month’s episode is a roundup of news from a variety of sources covering three main topics:

BI / Dataviz Tools
Databases and Platforms
Tools and Frameworks

Note: Most of the text extracts below are direct quotations from new sources cited in the source list at the bottom of these show notes. This episode is a compilation from those sources.

BI / Dataviz Tools

PowerBI enhancements (7/12/18)

Microsoft has updated its Power BI analytics service in an effort to expand data prep capabilities and unify data analytics across platforms.
“Using the Power Query experience familiar to millions of Power BI Desktop and Excel users, business analysts can ingest, transform, integrate and enrich big data directly in the Power BI web service – including data from a large and growing set of supported on-premises and cloud-based data sources, such as Dynamics 365, Salesforce, Azure SQL Data Warehouse, Excel and SharePoint,” the post reads.
Power BI now supports data in Azure Data Lake Storage, and integrates with SQL Server Analysis Services and SQL Server Reporting Services.
Microsoft today announced the general availability of Visio Visual for Power BI. Based on the feedback collected from the customers during the preview period, Microsoft has made the following changes to the Visio Visual:
- Support for Power BI Mobile app
- The ability to change the diagram link embedded earlier and to copy an embedded link to the clipboard
- Configurable auto-zoom settings that can be turned on and off
- Support for complex diagrams using layers
- Overall performance improvements

Tableau acquires Empirical Systems

Tableau last month announced the acquisition of Empirical Systems, an artificial intelligence (AI) startup with an automated discovery and analysis engine designed to spot influencers, key drivers, and exceptions in data.

Looker Enhances Data Science Capability with Integration for Google Cloud BigQuery ML

With Looker and BQML, data teams can now save time and eliminate unnecessary processes by creating machine learning (ML) models directly in Google BigQuery via Looker – without the need to transfer data into additional ML tools. BQML predictive functionality will also be integrated into new or existing Looker Blocks allowing users to surface predictive measures in dashboards and applications.

DBs and Platforms

MemSQL Unveil Significant Update to Database for Real-time Modern Applications and Analytical Systems (Version 6.5 released)

Queries are now up to four times faster than the previous MemSQL version (which was already 10x faster than legacy database providers), enabling insights in milliseconds across billions of rows.
New automated workload optimization capabilities provide a consistent database response under ultra-high concurrency without the need for manual tuning or specialized DBA resources.
Additions to the MemSQL industry-leading “transform-as-you-ingest” capabilities allow customers to use stored procedures for in-database transformations to easily build real-time data pipelines.
Resource optimization improvements for multi-tenant deployments deliver greater control and scalability for varied database sizes whether on-premises or in the cloud.

Hortonworks Data Platform 3.0

Even a Hadoop stalwart such as Hortonworks Inc. sees the writing on the wall, which is why, in its recent 3.0 release, it emphasized heterogeneous object storage. The new Hortonworks Data Platform 3.0 supports data storage in all of the major public-cloud object stores, including Amazon S3, Azure Storage Blob, Azure Data Lake, Google Cloud Storage and AWS Elastic MapReduce File System.
HDP’s latest storage enhancements include a consistency layer, NameNode enhancements to support scale-out persistence of billions of files with lower storage overhead, and storage-efficiency enhancements such as support for erasure coding across heterogeneous volumes. HDP workloads access non-HDFS cloud storage environments via the Hadoop Compatible File System API.
My thoughts: Are Hadoop and HDFS Dying???
As we are heading into the fourth industrial revolution, HDP 3.0 is a giant leap for the Big Data ecosystem, with major changes across the stack and expanded eco-system (Deep Learning and 3rd Party Dockerized Apps). HDP 3.0 can be deployed both on-premise and in the major cloud platforms – AWS, Microsoft Azure, and Google Cloud. Many of the HDP 3.0 new features are based on Apache Hadoop 3.1 and include containerization, GPU support, Erasure Coding and Namenode Federation. In order to provide a Trusted Data Lake, we are installing Apache Ranger and Apache Atlas by default with HDP 3.0. In order to streamline the stack, we have removed components such as Apache Falcon, Apache Mahout, Apache Flume, and Apache Hue, and absorbed Apache Slider functionalities into Apache YARN.

Tools and Frameworks

Python 3.7.0 is now available

Data classes that reduce boilerplate when working with data in classes.
A potentially backward-incompatible change involving the handling of exceptions in generators.
A “development mode” for the interpreter.
Nanosecond-resolution time objects.
UTF-8 mode that uses UTF-8 encoding by default in the environment.
A new built-in for triggering the debugger.
Easier access to debuggers through a new breakpoint() built-in
Simple class creation using data classes
Customized access to module attributes
Improved support for type hinting
Higher precision timing functions
More importantly, Python 3.7 is fast.
- Each new release of Python comes with a set of optimizations. In Python 3.7, there are some significant speed-ups, including:
  - There is less overhead in calling many methods in the standard library.
  - Method calls are up to 20% faster in general.
  - The startup time of Python itself is reduced by 10-30%.
  - Importing typing is 7 times faster.
You can easily get an idea of how much time the imports in your script takes, using -X importtime:

Apache OpenNLP 1.9.0 released

The Apache OpenNLP team is pleased to announce the release of Apache OpenNLP 1.9.0.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
Apache OpenNLP 1.9.0 binary and source distributions are available for download from our download page: download page
The OpenNLP library is distributed by Maven Central as well. See the Maven Dependency page for more details: Maven Dependency
What’s new in Apache OpenNLP 1.9.0
- This release introduces new features, improvements and bug fixes. Java 1.8 and Maven 3.3.9 are required.
- Additionally the release contains the following changes:
  - Brat Document Parser should support name type filters
  - Brat format support fails on multi fragment annotations
  - Remove MD5 hashes from Release process
  - Use String[] instead of StringList in LanguageModel API
  - BRAT Annotator service Fails to start
  - Token model creation fails without at least one tag
  - Update Penn Treebank URL
  - Explain the new format of feature generator XML config
  - Unify code to sum up input context features
  - FeatureGeneratorUtil can recognize Japanese Hiragana and Katakana letters

TensorFlow 1.9.0

Updated docs for tf.keras: New Keras-based get started and programmers guide page.
Update tf.keras to the Keras 2.1.6 API.
Added tf.keras.layers.CuDNNGRU and tf.keras.layers.CuDNNLSTM layers. Try it.
Adding support of core feature columns and losses to gradient boosted trees estimators.
The python interface for the TFLite Optimizing Converter has been expanded, and the command line interface (AKA: toco, tflite_convert) is once again included in the standard pip installation.
Improved data-loading and text processing with:
Added experimental support for new pre-made Estimators:
The distributions.Bijector API supports broadcasting for Bijectors with new API changes.

PYPL Language Rankings: Python ranks #1, R at #7 in popularity

The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7.

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

E29 – Is Data The New Oil?

Fri, 29 Jun 2018 11:56:28 +0000

Is Data the New Oil?

Concept originated by Clive Humby, the British mathematician who established Tesco’s Clubcard loyalty program. Humby highlighted the fact that, although inherently valuable, data needs processing, just as oil needs refining before its true value can be unlocked.
Why it is the new oil
- Valuable commodity
- Different uses among many applications
- Currently the big buzz of most large companies (Google, Facebook, Apple, etc.)
- Quantity is generally better in both
- AI is the darling of so many industries right now, and it is entirely dependent on data
- There are ethical concerns with how we source and use this, just like there were and are geopolitical and ethical concerns with how we source and use Oil
- Certain things cannot function (currently) without oil (passenger airplanes, boats)
  - Same with data: Oil & Gas, Netflix, Agriculture, Manufacturing, Healthcare, a general enabler
Why it isn’t the new oil
- Oil is finite, but data is not
  - Rob’s Counterpoint: There is a shelf life on data that makes it less usable over time
- Data does not have a standard price benchmark like oil
- Not a physical asset; can be duplicated or shared relatively easily
- Oil requires huge amounts of resources to recover and transport
  - Rob’s Counterpoint: building a successful “app” with the scale to generate meaningful data does have some costs, albeit not the scale of oil
- Data is more useful the more that it is used, whereas oil loses energy the more it is used/processed
  - Rob’s Counterpoint: Oil is not useful by itself to most people; it’s really the product oil becomes or enables that is useful

The Data of Oil

Difference between operating on surface vs. subsea: small tubing error occurs…
- Surface: 2-3 hours downtime; a few thousand $$ to fix
- Subsea: 3 months downtime, $40-50mm to fix, not including lost revenue due to deferred production (ex. 15,000 bpd well * $67/barrel * 90 days = $90.45mm)
A good sized offshore platform generates revenue greater than the entire country of Belize ($2.3bn vs. $1.8bn)
Of all the oil we can find, we generally only recover 10-20% in a field with current technology
45-50% of oil generated in the US is used for transportation
US consumption per day is about 2 ½ gallons of crude oil / day / person
The U.S. has 4% of the world’s population but uses 25% of the world’s oil
Total daily oil consumption around the world is 84,249,000 barrels/day
Top 3 countries by proven oil reserves are: Venezuela, Saudi Arabia, Canada; US is #10
Gas is 12,200 Wh/kg vs. Li-Ion at 265 Wh/kg (~46x more energy dense)
MTTF (Mean time to failure) – 500 years on some parts – needed to operate in subsea environments for 30 years
Area of dinner plate = 10.5”
- Area = Pi * R^2 * 20,000 PSI
- Area = ¼ * Pi * D^2 * 20,000 PSI
- 0.25 * pi * 10.5 * 10.5 * 20,000 = 1.73180295029137E6
- = 1,731,802 pounds on a single dinner plate (equivalent to ~9 737 Jets)
Length Records
- Analogy: Standing on top of the Empire State Building in NYC and trying to put a straw in a coke can sitting on the sidewalk below
- Deepest Well (scientific study) = Kola Superdeep Borehole= 40,230 ft.
- CHAYVO WELL – SAKHALIN-I PROJECT-The current world record holder for longest well; depth of 44,291 feet with a horizontal reach of 39,478 feet
- DEEPWATER HORIZON – drilled the deepest oil well in history. The well was drilled to 35,050 vertical depth
Well depth by year

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

Is Data the New Oil?

The Data of Oil Sources:

Other Data is the New Oil Sources:

Other Oil Sources:

E028 – Bimodal BI and Data Virtualization

Sun, 27 May 2018 04:31:15 +0000

Today we’re back with another guest from the Netherlands. I’m not sure what it is about the Dutch, but they’ve been on a roll with some helpful thought leadership when it comes to data. My guest is Rick van der Lans, a highly-respected analyst, consultant, author, and international lecturer specializing in data warehousing, business intelligence, big data, and database technology.

I came across one of Rick’s whitepapers a few months ago on data virtualization. We got in touch and sat down to talk more in depth about the topic. Rick has a lot of data street cred. For many years, he has served as the chairman of the annual European Enterprise Data and Business Intelligence Conference in London and the annual Data Warehousing and Business Intelligence Summit in The Netherlands. He has written tons of articles, blogs, and several books, including the first book on SQL. There will be links to some of the places and things Rick has written and other info in the show notes below.

Topics:

Rick’s background: author, blogger, consultant – worked on data virtualization (DV) for last 6-7 years
How did Rick get interested in DV?
Classical data warehouses vs. logical data warehouses
What is bi-modal BI? (term introduced by Gartner in 2014)
- Agile/Self-Service vs. longer, more cautious approach
Bi-modal BI vs. the Data Quadrant
Comparison of major Data Virtualization Vendors
- Denodo
- Tibco DV Manager (bought from Cisco recently)
- Red Hat
- Data Virtuality Ultrarep
- Others (AtScale, Cero, StoneBond, IBM – new entry acquired from Rocket Software)
- Some are more mature, some are newer (Denodo vs. Tibco = green apples vs. red apples)
Companies rolling their own DV (in-memory / views vs. a dedicated tool)
DV products are not DB views on steroids
Lineage / impact analysis and other features
Caching vs. materialization – can store cached data in a virtual table in an intermediary data store. Can be help performance or prevent interference from a transactional source (keeping results consistent for an entire week).
How DV can help organizations that are struggling
How DV may not be a silver bullet
How are different industries embracing these principles?
What patterns do you see in companies embracing these principles?
What companies should not use this? DV not great at this time on unstructured audio / video, auto-tagging of images
Why a classical DWH experienced person may fail at DV
What are the warning signs that a DV is going off the rails?
- Fuzzy logic needed to combine disparate sources
- Not an integration cureall
- How you deploy these with projects
How to get started? (pick a single, sexy report as a starting point)
Where do you go next? (how to unify other data delivery systems, data marketplaces, API gateways)
How to avoid misconceptions about DV (it is slow, only about integration, etc.)
How to contact Rick
The first book on SQL

Places to find Rick’s work:

He has published blogs for the following websites:

He has written the following books:

Data Virtualization for Business Intelligence Systems
Introduction to SQL (has sold over 100,000 copies and has been translated in many languages)
SQL for MySQL Developers
The SQL Guide to SQLite
The SQL Guide to Ingres
The SQL Guide to Pervasive PSQL

Music:

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

E027 – The Data Quadrant

Sat, 28 Apr 2018 19:18:09 +0000

My guest in today’s episode is Ronald Damhof (@ronalddamhof), the creator of the Data Quadrant. This quadrant is a sense-making framework in the complex word of data that enables a common frame of reference between managers, domain experts and engineers. This model is used by many organisations to formulate data strategy and justify investments in the data domain. It is used as the strategic underpinning for a data architecture, it guides the ‘rules of the game’ and it separates the fundamental concerns in data. Furthermore, it explains how an organisation can toggle the need to innovate with data and the need to deploy and use data at scale, repeatedly, safe, lawful, with constant quality and robust.

Topics:

Ronald background as a “data fundamentalist”
His concept of a full scale data architect
The push / pull point, from 1950s Toyota, applied to data
Development styles from systemic to opportunistic
Data Vault’s influence on the quadrant
Where data modelling (Q1) and data lakes (Q3) fit into the quadrants
Where should you start? Q1/Q2 or Q3/Q4
90% of organizations in the Netherlands are using Data Vault

General recommendations on tools by quadrant:

Q1
- Automation – Wherescape or custom
- Federalization – mainly still RDBMS
Q2
- API’ing the data
- Losing faith in datasets and data marts
Q3
- Fast infra
- Doesn’t believe Hadoop is a good fit for most orgs.
- Likes fast analytical DBs like Vertica or MonetDB?
Q4
- Open source
- R, Python, Git, Dataiku
- Abstraction layer away from code is helpful
- Azure Platform

Ronald Damhof’s background:

Primary degree in Economics
Certified Data Vault Grand Master
Data Architect at the Dutch Central Bank in the Netherlands

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

E026 – The Four Types of Automation

Fri, 30 Mar 2018 05:30:14 +0000

Introductory Product Models:

Only partner implementations (BluePrism)
Limited Features (WorkFusion)
Customer Revenue Limited (UI Path)
Single License (Softomotive)

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

E025 – The Hype of AI

Wed, 28 Feb 2018 05:28:14 +0000

Thank you my friend and fellow Capco cohort, Daragh Fitzpatrick, for joining me on this episode of FTLOD where we cut through the Hype of AI to understand some of the key challenges and opportunities facing consumers and businesses alike when working with or alongside AI.

Given that we’re talking about AI, I also have a twist for today’s interview–transcription! Today’s episode is transcribed here using machine learning from webASR, a free service provided through the University of Sheffield’s Machine Intelligence for Natural Interfaces (MINI).

Note: The transcription is wonderful as a starting point and for a free service, but it does diverge from the actual conversation fairly significantly at times. Please listen to the episode as you read along.

Topics:

Definition of AI and the singularity–should we be concerned?
What’s going on in the AI space?
Typical use cases in industry
RPA vs. AI and different use cases
Recommendation systems
Challenges in profiling users or customers
Ethical challenges and consequences of bad AI or black box AI
AI is like fire: it can be highly useful, but it can also be a weapon and burn you.
Perceptions of AI that are overhyped
Not every product or service needs AI to be good
At what point does intelligence begin?
The fourth industrial revolution and its impact on society
How to responsibly introduce life-altering AI
Will AI supplement our lives and give us better quality of live, or will it make us do more, faster, stronger?
Advancements in how AI plays the game, Dover

Some of the items we discuss are available in the following places:

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

E024 – Will machine learning kill traditional database indexes?

Wed, 31 Jan 2018 07:47:49 +0000

In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.

Background:

Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices
Trained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.
FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.

They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)
“Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.

Observations / Questions:

Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.
B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.
The paper made some rudimentary assumptions, such as using a random hash function.
- What if the data is not static? How long would it take to retrain the model vs. maintain an index?
- What if data profiling caused you to index certain attributes and not others?
- What are the best practices with this newer approach
The power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.
Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”
Recursive question: do you need an optimizer to optimize the optimization path?
Room for improvement:
- GPUs/TPUs
- Incorporating common queries into the model to know what questions people are asking

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

E023 – 2017 Data Digest

Sat, 30 Dec 2017 07:12:22 +0000

This episode reflects on some of the hottest topics from 2017 and their impact their data has on our lives this year and into 2018.

Cryptocurrency

Many of these data points come from here.

Since the year began, the aggregate market cap of all cryptocurrencies combined has increased by more than 3,200% as of Dec. 18
Bitcoin went through the roof, hitting an all-time high of 1 BTC = $19,891 on 12/17/2017.

BTC makes up 54% of the aggregate $589 billion market cap of all cryptocurrencies
The graphics-card hardware needs of miners has been a big reason why NVIDIA and Advanced Micro Devices have seen a double-digit percentage surge in sales recently
Back on Dec. 10, CBOE Global Markets (NASDAQ:CBOE) became the first to introduce bitcoin futures trading, with CME Group (NASDAQ:CME) following a week later
612 new cryptocurrencies began trading in 2017
Top 10 cryptocurrencies in 2017 as of 12/29 according to BitInfoCharts.com (pretty similar list on AtoZForex.com):

Cryptocurrency

Price in USD

Price in BTC

First Trade

Exchange volume 24h

BTC

Bitcoin

$ 15,030.33

+9.79% ($1,340) in 12h

+9.56% ($1,312) in 7d

1 BTC

+0% in 12 hours

+0% in 7 days

2010-07-17

100,317 BTC

100,316.59 BTC

1,250,728,823.58 USD

XRP

Ripple

$ 1.4

+11.79% ($0.15) in 12h

+28.92% ($0.31) in 7d

0.000093 BTC

+1.82% in 12 hours

+17.67% in 7 days

2014-08-14

462,239,606 XRP

36,699.78 BTC

551,610,001.98 USD

ETH

Ethereum

$ 750.82

+9.3% ($63.9) in 12h

+12.07% ($80.9) in 7d

0.05 BTC

-0.45% in 12 hours

+2.29% in 7 days

2014-09-30

784,632 ETH

34,510.75 BTC

518,708,138.27 USD

BCH

Bitcoin Cash

$ 2,571.87

+8.04% ($191) in 12h

+6.17% ($149) in 7d

0.171 BTC

-1.59% in 12 hours

-3.1% in 7 days

2017-08-01

209,597 BCH

33,824.21 BTC

508,389,215 USD

LTC

Litecoin

$ 255.88

+11.54% ($26.5) in 12h

+0.89% ($2.26) in 7d

0.017 BTC

+1.59% in 12 hours

-7.91% in 7 days

2012-07-13

1,156,615 LTC

18,070.26 BTC

271,602,057.82 USD

IOT

IOTA

$ 3.87

+11.04% ($0.38) in 12h

+2.12% ($0.08) in 7d

0.00026 BTC

+1.14% in 12 hours

-6.8% in 7 days

2017-08-30

37,838,946 IOT

9,288.14 BTC

139,603,822.48 USD

XMR

Monero

$ 366.67

+7.1% ($24.3) in 12h

+9.42% ($31.6) in 7d

0.024 BTC

-2.46% in 12 hours

-0.13% in 7 days

2014-06-04

275,568 XMR

6,354.54 BTC

95,510,916.21 USD

DASH

Dash

$ 1,126.09

+10.81% ($110) in 12h

+2.5% ($27.5) in 7d

0.075 BTC

+0.93% in 12 hours

-6.45% in 7 days

2014-02-20

85,797 DASH

6,073.26 BTC

91,283,097.97 USD

XVG

VERGE

$ 0.168

+41.89% ($0.05) in 12h

+60.02% ($0.06) in 7d

0.000011 BTC

+29.23% in 12 hours

+46.05% in 7 days

2016-02-18

606,321,139 XVG

5,940.19 BTC

89,283,020.8 USD

ICX

ICON

$ 5.73

+4.78% ($0.26) in 12h

+183.99% ($3.72) in 7d

0.00038 BTC

-4.56% in 12 hours

+159.21% in 7 days

2017-11-11

13,061,177 ICX

5,072.57 BTC

76,242,347.41 USD

Data Breaches

Equifax – 9/7/2017 – 143mm US consumers affected
1. Stock plunged nearly $4bn in the aftermath
2. https://www.equifaxsecurity2017.com/
RNC Voter List – nearly every registered voter, ~200mm Americans
Yahoo’s 2013 breach revelation – affected accounts went from 1bn to 3bn
Uber – 57mm user accounts and drivers, paid to keep it under wraps
560mm Passwords – a massive list of 560mm credentials compiled into one database of breaches from at least 10 services

You can check if your account is part of a compromise at have i been pwned or SpyCloud.

World Affairs

The World Bank has a fascinating article with 12 charts covering food assistance, climate change, education, nutrition, elections, energy and a tribute to Hans Rosling, who made us see the world in new ways with breathtaking visualizations.

Other Data Tidbits

Most popular Instagram Post: Beyonce – https://www.instagram.com/p/BP-rXUGBPJa/
Most retweeted Twitter Post: Carter’s quest for Wendy’s Chicken Nuggest – https://twitter.com/carterjwm/status/849813577770778624/photo/1
Oracle bought API management firm Apiary. Be on the lookout for how that evolves for the tool and for Oracle
RPA saw continued growth and implementations. Expect more in 2018.
Kubernetes is becoming the de facto standard for container management and was upgraded to Adopt by TechRadar. Expect it to continue to gain steam and start influencing data solutions more in 2018.

Music:

Auld Lang Syne by Fresh Nelly, from Free Music Archive.

Sources:

E022 – Tech Spec – Tableau Project Maestro Data Prep

Thu, 30 Nov 2017 10:34:43 +0000

Zip file of all the sample data, Maestro flows, and Tableau workbook I used to get a first impression: E022_maestro_demo_files.

Screenshots

Sample Flow from Tableau

Field Selection

Data Profiling

Filters

Join Clause

Refresh / Run Flow

File Output Options

Pros:

Has the clean, intuitive feel of Tableau. I did my hands-on test with no training or previous exposure
Lots of features for a first release – joins, unions, type conversion, calculated fields, data connectors, etc.
Easy to click into any part of your flow and see data
Ability to edit inline – much like tweaking an Excel pivot table
Data profiling is a nice visual cue to begin working with data
Ability to sort, filter, rename, add calculated fields anywhere along the way
Great for quick and dirty data prep that you know is heading into Tableau for ad-hoc analysis

Cons:

Ability to sort, filter, rename, add calculated fields anywhere along the way – this can get messy for others to come behind you to maintain or see what is happening
Reconciliation issues between reports will now be complicated by similar flows doing slightly different things
You have to remove header fields from Excel if you want Maestro to latch onto and display field names from table. By default, it looks at first row and gives generic names if column headings aren’t there (i.e., F1, F2, …)
Can only have one flow open at any time
Performance seems a tiny bit slow on my example with ~13,000 rows. Curious to see how it will perform against larger data sets, RDBMS, and big data connectors
Only outputs to TDE or Hyper formats currently. No ability to save as CSV, XLSX, PDF, or write back to a data store
Unable to source data from a TDE or Tableau Workbook
No reuse of common transformations or logic across different flows
NO community generated content yet – since it is very new, you can’t Google for answers or YouTube videos. Established, mature ETL and data prep tools will continue to have a leg up on this front for a while.

Music

Deep Sky Blue by Graphiqs Groove

Sources:

E021 – Data Deep Dive – Halloween Spending and Candy

Tue, 31 Oct 2017 04:29:05 +0000

Just in time for Halloween this year, we take a look at the way people will spend their money on Candy and other goods during this spooky time.

Spending

People in the US are expected to spend $9.1 billion on Halloween this year, according to a study by the National Retail Federation.

Several predictions about this year’s Halloween season include:

U.S. consumers are projected to drop $82.93 on average, up almost 12 percent from $74.34 last year.
More than 171 million consumers are expected take part in Halloween festivities.
Adults ages 18-34 are projected to spend on average $42.39, compared with $31.03 for all adults.

According to the survey, consumers plan to spend:

$3.4 billion on costumes (purchased by 69 percent of Halloween shoppers),
$2.7 billion on candy (95 percent),
another $2.7 billion on decorations (72 percent)
and $410 million on greeting cards (37 percent).

Among Halloween celebrants:

71 percent plan to hand out candy,
49 percent will decorate their home or yard,
48 percent will wear costumes,
46 percent will carve a pumpkin,
35 percent will throw or attend a party,
31 percent will take their children trick-or-treating,
23 percent will visit a haunted house and 16 percent will dress pets in costumes.

Top Costumes

More than 3.7 million children plan to dress as their favorite action character or superhero, 2.9 million as Batman characters and another 2.9 million as their favorite princess while 2.2 million will dress as a cat, dog, monkey or other animal.

Proving that Halloween isn’t just for kids, a record number of adults (48 percent) plan to dress in costume this year. More than 5.8 million adults plan to dress like a witch, 3.2 million as their favorite Batman character, 3 million as an animal (cat, dog, cow, etc.), and 2.8 million as a pirate.

Pets won’t be left behind when it comes to dressing up for Halloween. Ten percent of pet lovers will dress their animal in a pumpkin costume, while 7 percent will dress their cat or dog as a hot dog and 4 percent as a dog, lion or pirate.

Candy

CandyStore.com released data from 10 years of bulk candy online sales that show favorite candies by state.

STATE	TOP CANDY	POUNDS	2ND PLACE	POUNDS	3RD PLACE	POUNDS
TX	Starburst	1952361	Reese’s Cups	1927663	Almond Joy	837525

STATE	TOP CANDY	POUNDS	2ND PLACE	POUNDS	3RD PLACE	POUNDS
AL	Candy Corn	55274	Hershey’s Mini Bars	54369	Tootsie Pops	42533
AK	Twix	4678	Blow Pops	4578	Kit Kat	3892
AZ	Snickers	904633	Hershey Kisses	817463	Hot Tamales	527843
AR	Jolly Ranchers	225990	Butterfinger	215897	Hot Tamales	89027
CA	M&M’s	1548990	Salt Water Taffy	1345782	Skittles	1034527
CO	Milky Way	5620	Twix	5478	Hershey Kisses	4087
CT	Almond Joy	2457	Milky Way	1985	M&M’s	1023
DE	Life Savers	20748	Skittles	18072	Candy Corn	10217
FL	Skittles	630938	Snickers	587385	Reese’s Cups	224637
GA	Swedish Fish	130647	Hershey Kisses	109672	Jolly Ranchers	55049
HI	Skittles	267872	Hershey Kisses	264728	Milky Way	139874
ID	Candy Corn	85903	Starburst	60826	Reese’s Cups	39847
IL	Sour Patch Kids	155782	Kit Kat	151786	Reese’s Cups	95627
IN	Hot Tamales	95092	Starburst	78920	Snickers	34589
IA	Reese’s Cups	58974	M&M’s	53982	Butterfinger	25782
KS	Reese’s Cups	231476	M&M’s	230082	Dubble Bubble Gum	159092
KY	Tootsie Pops	67829	3 Musketeers	60273	Reese’s Cups	30865
LA	Lemonheads	102833	Reese’s Cups	89738	Jolly Ranchers	45092
ME	Sour Patch Kids	58290	M&M’s	45938	Starburst	16782
MD	Milky Way	38782	Reese’s Cups	30748	Blow Pops	12093
MA	Sour Patch Kids	75638	Butterfinger	73892	Salt Water Taffy	45982
MI	Candy Corn	146782	Skittles	135982	Starburst	87740
MN	Tootsie Pops	195783	Skittles	194672	Almond Joy	98726
MS	3 Musketeers	109783	Snickers	103993	Butterfinger	57829
MO	Milky Way	42739	Dubble Bubble Gum	34751	Butterfinger	24780
MT	Dubble Bubble Gum	24675	M&M’s	14673	Twix	13784
NE	Sour Patch Kids	106728	Salt Water Taffy	78624	M&M’s	23674
NV	Hershey Kisses	322884	Candy Corn	203746	Skittles	167837
NH	Snickers	63876	Starburst	62468	Salt Water Taffy	25987
NJ	Skittles	159324	Tootsie Pops	157893	M&M’s	110673
NM	Candy Corn	83562	Milky Way	65682	Jolly Ranchers	45721
NY	Sour Patch Kids	200008	Candy Corn	101292	Reese’s Cups	56776
NC	M&Ms	96110	Reese’s Cups	95763	Candy Corn	62308
ND	Hot Tamales	65782	Jolly Ranchers	61829	Candy Corn	51827
OH	Blow Pops	150324	M&M’s	146782	Starburst	105752
OK	Snickers	20938	Dubble Bubble Gum	10283	Butterfinger	8892
OR	Reese’s Cups	90826	M&M’s	67626	Tootsie Pops	42774
PA	M&M’s	290762	Skittles	281847	Hershey’s Mini Bars	150372
RI	Candy Corn	17862	M&M’s	13894	Twix	9003
SC	Candy Corn	114783	Skittles	98782	Hot Tamales	41892
SD	Starburst	24783	Jolly Ranchers	22983	Candy Corn	7827
TN	Tootsie Pops	59837	Salt Water Taffy	34859	Skittles	20938
TX	Starburst	1952361	Reese’s Cups	1927663	Almond Joy	837525
UT	Jolly Ranchers	475221	Reese’s Cups	29823	Tootsie Pops	198564
VT	Milky Way	29837	M&M’s	27811	Skittles	17662
VA	Snickers	26783	Hot Tamales	26178	Candy Corn	18726
WA	Tootsie Pops	223850	Salt Water Taffy	210981	Hershey Kisses	78662
DC	M&M’s	26092	Tootsie Pops	21364	Blow Pops	14763
WV	Blow Pops	43776	Hershey’s Mini Bars	23554	Milky Way	18911
WI	Starburst	116788	Butterfinger	115982	Jolly Ranchers	42998
WY	Reese’s Cups	32889	Salt Water Taffy	26555	Skittles	20812

FiveThirtyEight took a different approach by analyzing data from 269,000 head-to-head matchups between candies. Their findings:

Reese’s took 4 of the top 10 spots!

They boiled it down into the following elements:

Music

In This Creepy, Sleepy Backward Town by Squire Tuck via Free Music Archive

Sources

E020 – How Crisis Text Line uses data to save lives

Wed, 27 Sep 2017 02:13:59 +0000

If you’re in crisis, text 741741 if you’re in the US to talk with a counselor now. In this episode we speak with the people behind Crisis Text Line and Crisis Trends, two services that use data to make a difference for those going through a crisis or looking for someone with whom to talk.

Overview

Texters contact the hotline by texting the shortcode 741741. Volunteers are logged onto “the platform”, which is on CTL’s internal site, to receive these messages and access counselor tools.
Their data is collected in real time and is updated in close to real time: https://crisistrends.org/
This is the TED talk where the founder introduced her idea for the organization: https://www.ted.com/talks/nancy_lublin_texting_that_saves_lives
This is the TED talk 3 years later where the founder shared an update on CTL’s success and shared information about how they use data intelligently on their platform: https://www.ted.com/talks/nancy_lublin_the_heartbreaking_text_that_inspired_a_crisis_help_line

Key Stats

Over 1 million messages transmitted per month
75% of texters are under 25
10% under age 13
65% say they have shared something with Crisis Text Line that they haven’t shared with anyone else
Usually at least one active rescue per day
Take people based on severity and have the ability to initiate an active rescue (via 911)
- Words like ibuprofen, aspirin, tylenol are more indicative of active rescue need than the words die, overdose, suicide
- emoji is 4x more of an indicator
Roots of CTL go back to 1906 when Save-A-Life League started via newspaper ads
- The Samaritans was the first phone suicide hotline and started in November 1953
Founded by Nancy Lublin, who is also the CEO of DoSomething.org, in 2011

Introductions – background, how they got their start, how they got involved in CrisisTextLine
- Staci – volunteer
- Scotty – Data Scientist
History of Crisis Text Line and high-level structure (where they operate, # of locations, # of employees / volunteers)
Staci’s experience
- What was training like?
- Where do she take sessions and how often?
- How do she feel after a session?
- Her experience as a counselor and thoughts on the impact, data, etc.
What ways they collect data
- #s of texters
- UI platform for counselors
- Types of data they collect
- Types of technologies used to collect/manage it – both publicly, behind the scenes, for presentations, etc.
What ways they use data
- CrisisTrends.org site
- Anonymity, opt-in/opt-out options and how frequent each occur
Key stats they feel are most important/surprising/alarming, etc.
- How has data made an impact to those in need?
- How has data made an impact to counselors?
- How has data made an impact to the organization?
- How has data made an impact to the crisis advocacy sector as a whole?
What ways can other people can use their data
- Do they encourage that visitors explore to find their own insights?
- Will data be available by zip code at some point?
Data Science
- What tools and techniques do they see being most important in the near term?
- What do they see as becoming less important in the near term?
- What is something they could have told their earlier selves that would have made their path to this point easier?
Organization Info
- How someone can get involved
- What they need most
- What is in store for the future? New technologies, platforms for contact, etc.
- How someone can contact them

Music

Deep Sky Blue by Graphiqs Groove

Sources

https://youtu.be/KOtFDsC8JC0 – TED talk about origin
https://www.crisistextline.org/
https://crisistrends.org/
http://www.newyorker.com/magazine/2015/02/09/r-u

E019 – Tech Spec – Cognos Analytics (11.0.6)

Sun, 20 Aug 2017 22:09:44 +0000

Join me as I chat with my colleague and Cognos guru John Frazier about the latest release of Cognos, leading up to the anticipated release of the next version, 11.0.7, near the end of Q3.

The latest version of Cognos (11.0.6) debuted on March 21, 2017. You can sign up for a perpetually free trial (like Tableau Online) here.

Version 11 was originally released in December 2015 and was mainly a UI redesign on top of Cognos 10 features. Analysis and Query Studios will eventually be deprecated.

New Features in 11 vs. 10

New UI – responsive web design on UI, but not on reports
Better self-service capabilities and collaboration for teams
Upload data files – upload delimited text or Excel files to be stored in a columnar format (Parquet) on the file system (not in memory or in the DB). These are immediately usable in dashboards and don’t require entry into FM.
Data modules (intent based modeling based on Watson) similar to FM packages
- Note: Dashboards only use uploaded files and data modules
Available on cloud
Mobile and desktop from a single report
Active reports as prompts
Free cloud trial
Admin console is unchanged

New Features in 11.0.6

Mapping enhancements
- Multiple admin boundaries, add’l postal code support
Dashboarding enhancements
- Direct access to OLAP packages (Framework packages accessible since 11.0.5)
- Widgets using data from the same source are connected by default
- New grid widget
- Color gradient by measure
- Date filters can include blanks
Portal enhancements
- Share/embed through overflow menu
- Folder customizations can be done directly through the UI more easily (without uploading JSON configs)
- Create shortcuts and report views
Storytelling enhancements
- New guided journey templates
- New animations (side fade, slide, scale, zoom, pivot)
- Better pins (smart named, better search and filter)
- Timelines – smart names
- Change scene template while working on your story/dashboard
Reporting enhancements
- Better lineage support for FM packages
- Business glossary (w/IBM InfoSphere Information Governance Catalog integration)
- Better freeze list column heading control
- Better query support when editing data modules
- Report templates – can save for your team or save as style reference reports
Support for Planning Analytics
- Dashboard support for TM1 / Planning cubes
- REST connectivity to planning analytics
- Support for attribute hierarchies
- Support for localized Planning Analytics cubes
Data server enhancements
- Support for Google BigQuery and Google Cloud SQL via the BigQuery JDBC and MySQL JDBC drivers, respectively.
- JDBC URL for Data Server Connections
- Test connection feedback (this is not just in admin console now)

John’s Likes/Dislikes with v11:

For those who are “used” to ReportStudio there is a pretty “steep” learning curve to locate where particular tools or components have been moved.
To be fair, ReportStudio had some counter-intuitive placements for some of these same tools (e.g. Hierarchy of design elements, etc.) that caused major headaches for new report designers.
Overall the new interface is more “intuitive” and the novice report developers I’ve worked with have picked it up remarkably quickly.
There are some changes that are really “nice” – like being able to see which Lists/Graphs use a particular query right from the query tree without having to “search” for where it is used on the “right click” menu.

Music

Deep Sky Blue by Graphiqs Groove

Sources

E018 – Tech Spec – Sia, ultimate blockchain file storage

Sun, 30 Jul 2017 03:25:20 +0000

What if you could store your data in the cloud, encrypted, for a fraction of the cost of Amazon S3, Google, or Azure? With Sia, a decentralized file storage solution that leverages blockchain, you can. Learn more about how it works in this episode.

Blockchain Overview

A blockchain is a permissionless distributed database that maintains a continuously growing list of transactional data records. The system’s design means it is hardened against tampering and revision, even by operators of the nodes that store data. The initial and most widely known application of the block chain technology is the public ledger of transactions for bitcoin, but its structure has been found to be highly effective for other financial vehicles.

[Illustration by Matthäus Wander (Wikimedia)]

Timestamp: The time when the block was found.
Reference to Parent (Prev_Hash): This is a hash of the previous block header which ties each block to its parent, and therefore by induction to all previous blocks. This chain of references is the eponymic concept for the blockchain.
Merkle Root (Tx_Root): The Merkle Root is a reduced representation of the set of transactions that is confirmed with this block. The transactions themselves are provided independently forming the body of the block. There must be at least one transaction: The Coinbase. The Coinbase is a special transaction that may create new bitcoins and collects the transactions fees. Other transactions are optional.
Target: The target corresponds to the difficulty of finding a new block. It is updated every 2016 blocks when the difficulty reset occurs.
The block’s own hash: All of the above header items (i.e. all except the transaction data) get hashed into the block hash, which for one is proof that the other parts of the header have not been changed, and then is used as a reference by the succeeding block.

Sia Overview

Decentralized network that places encrypted pieces of your data on dozens of notes
Aims to be fastest, cheapest, most secure storage solution and compete with AWS, GCP, Azure
Users pay in Siacoins, a cryptocurrency like Bitcoin
- Must go USD -> Bitcoin -> Siacoin -> Wallet -> File Upload
Open source
Started by David Vorick and Luke Champine through a VC backed Boston-based company called Nebulous Inc
Origins in the HackMIT 2013 conference
Uses ASICs (application specific integrated circuits) for mining
- These are purpose built integrated circuits, not general multi-use devices
- Evolution from CPU -> GPU – ASIC
- Faster and less vulnerable to attacks than GPUs
- Why? See here.
- Created a company to make ASICs called obelisk.
- ~$2,500 per machine
Current price is about 124 Siacoin to $1USD

Pros

Decentralized, peer-to-peer
Encrypted and immutable
Hosts can earn money by renting free disk space to renters
- Must maintain 95% uptime to preserve collateral

Possible Issues

Renters uploading illegal content to hosts
- However, renters would have to pay for the bandwidth leechers use to download files
Slow at this point
Low number of users

Music

Deep Sky Blue by Graphiqs Groove

Sources:

Website: http://sia.tech/
Wiki: https://siawiki.tech/
Github: https://github.com/NebulousLabs
Slack: https://slackin.sia.tech/
Forum: https://forum.sia.tech/
Sia vs. USD: http://siapulse.com/page/market
How to get started: https://blog.sia.tech/getting-started-with-private-decentralized-cloud-storage-c9565dc8c854
Conversion: https://www.coingecko.com/en/price_calculator/siacoin/usd
https://bitcoin.stackexchange.com/questions/12427/can-someone-explain-how-the-bitcoin-blockchain-works
https://bitcoin.stackexchange.com/questions/12427/can-someone-explain-how-the-bitcoin-blockchain-works

E017 – Tech Spec – Tableau 10.3 New Features

Thu, 29 Jun 2017 11:19:12 +0000

In this episode we cover the new features in Tableau 10.3. This version debuted on May 31st, and a 10.3.1 update was released on 6/21/17.

Data Driven Alerts
1. Only on Tableau Server
2. Receive an alert when a mark crosses a visual threshold
3. Can use on any viz with a continuous numeric axis
4. Can sign up yourself and others; then each person can self-administer
5. Default check rate is 60 minutes or when an extract is refreshed. Can customize with this command:

tabadmin set dataAlerts.checkIntervalInMinutes

tabadmin restart

Tableau Bridge – Limited Release
1. Connect to live, on-premise data from Tableau Online
2. Replaces the sync client – is basically the sync client + live query functionality. Client is installed and ran behind your firewall and pushes data to Tableau Online.
3. Live connections must be enabled by administrators. Limited to RDBMSs (MySQL, SQL Server, etc.)
4. Oracle cloud hosted DBs must use Tableau Bridge
1. Must run as a service to enable live connections
1. Must embed credentials in Tableau Bridge if you want it to automatically update on a schedule
2. Will restart every hour minimum. You can set this window with this command:

tabonlinesyncclientcmd.exe SetDataSyncRestartInterval –restartInterval=

Best Practices (https://www.tableau.com/about/blog/2017/5/introducing-tableau-bridge-live-queries-premises-data-tableau-online-70767)
1. Split bridges into two machines: one for extract refreshes and another for live queries, unless usage is extremely low
2. Run the bridge continuously (ideally on a VM in a data center)
3. Tune dashboards and queries to leverage extracts for summarized data

Smart Table and Join Recommendations – Machine Learning will recommend tables and joins (even on non-similar names) based on previous usage metrics
PDF Connector
1. Connect to PDFs, identify tables, and pull data out
2. Less copying/pasting/massaging of data to get it ready for Tableau
3. Somewhat limited at this time, but continuing to be developed
More Union support in more connectors
1. DB2
2. Hadoop
3. Teradata
4. Netezza
New connectors
1. Amazon Athena
2. MongoDB BI
3. OneDrive
4. ServiceNow
5. Dropbox
6. JSON – scan entire file, not just a sample
Automatic Query Caching – Tableau server can pre-cache queries in recent workbooks after an extract refresh to speed up performance on initial load.
Miscellaneous
1. More options in Web Authoring (drills, formats, changing displays)
2. Story points navigator – more streamlined
3. Mobile – Android improvements, banner to Tableau Mobile, universal linking that allows you to click and open in Tableau Mobile
4. Tooltip selections – highlight data from tooltip links
5. Latest date filter
6. Distribute evenly
7. Maps – French, Netherlands, Australian, and New Zealand updates
8. Apply table calc filters to totals
9. Custom subscriptions – days/hours, etc.
10. APIs – various REST updates (tags on sources and views, switch sites, get sites list, etc.)

Music is Deep Sky Blue by Graphiqs Groove

Sources

E016 – For the Love of Sunscreen

Wed, 31 May 2017 07:59:59 +0000

In this episode, data sheds some (sun)light on what Rob did wrong on a recent trip to the Caribbean and explains the terrible sunburn he has right now. Just in time for Memorial Day and Summer, we take a look at many recent findings and how they will lead us to a healthier outdoor lifestyle.

A LOT of this content came from the Environmental Working Group (EWG). Please visit their site for more great info and the source of much of this episode.

EWG recently released it’s 2017 EWG Sunscreen Guide with research and guidance on sunscreen efficacy, ingredients, and health risks. It is chock full of great information to keep you safe and dispels many misconceptions that most people hold.

Why are sun rays harmful?

UV radiation penetrates the skin and produces genetic mutations that can cause cancer
UVA
- Less intense than UVB, but 30-50x more prevalent
- Dominant tanning ray
- UVA rays penetrate deeper, suppress the immune system, cause harmful free radicals to form, and are associated with higher risk of melanoma
UVB
- UVB rays are the primary cause of sunburns and non-melanoma skin cancer.
- Most intense from 10AM-4PM April through October
- Most reflected by snow or ice
- The chemicals in sunscreen help combat UVB rays more than UVA

Why the Sun (UV Exposure) is Harmful³

New melanoma cases among American adults has tripled since the 1970s, from 7.9 per 100,000 people in 1975 to 25.2 per 100,000 in 2014 (NCI 2017)
Melanoma death rate for white American men, the highest risk group, has escalated sharply, from 2.6 deaths per 100,000 in 1975 to 4.4 in 2014
Since 2003, the rates of new melanoma cases among both men and women have been climbing by 1.7 and 1.4 percent per year, respectively, according to the federal Centers for Disease Control and Prevention (CDC 2016)
More than 3 million Americans develop skin cancer each year (ACS 2017)
Most cases involve one of two disfiguring but rarely fatal forms of skin cancer – basal and squamous cell carcinomas. Studies suggest that basal and squamous cell cancers are strongly related to UV exposure over years.
- Several researchers have found that regular sunscreen use lowers the risk of squamous cell carcinoma (Gordon 2009, van der Pols 2006) and diminishes the incidence of actinic keratosis – sun-induced skin changes that may advance to squamous cell carcinoma (Naylor 1995, Thompson 1993)
- Researchers have not found strong evidence that sunscreen use prevents basal cell carcinoma (Green 1999, Pandeya 2005, van der Pols 2006, Hunter 1990, Rosenstein 1999, Rubin 2005).
- Both UVA and UVB rays can cause melanoma, as evidenced by laboratory studies on people with extreme sun exposures. In the general population, there is a strong correlation between melanoma risk and a person’s number of sunburns, particularly those during childhood (Dennis 2010).
- The use of artificial tanning beds dramatically increases melanoma risk (Coleho 2010).
People who rely on sunscreens tend to burn, and sunburns are linked to cancer.
- When people use sunscreen properly to prevent sunburn, they often extend their time in the sun. They may prevent burns, but they end up with more cumulative exposure to UVA rays, which inflict subtler damage (Autier 2009, Lautenschlager 2007).

However, research isn’t conclusive how the link between UV exposure and sunscreen.

Scientists don’t know conclusively whether sunscreen can help prevent melanoma. There are studies on both sides that say it helps or it does not.
Several factors suggest that regular sun exposure may not be as harmful as intermittent and high-intensity sunlight. Paradoxically, outdoor workers report lower rates of melanoma than indoor workers (Radespiel-Troger 2009).
- Melanoma rates are higher among people who live in northern American cities with less year-round UV intensity than among residents of sunnier cities (Planta 2011).
- Researchers speculate that higher vitamin D levels for people with regular sun exposure may play a role in reduced melanoma risk (Godar 2011, Newton-Bishop 2011, Field 2011).
  - So DRINK MILK!
- The consensus among researchers is that the most important step people can take to reduce their melanoma risk is to avoid sunburn but not all sun exposure (Planta 2011).

What is SPF?

SPF = Sun Protection Factor
How much longer it will take for sun to redden skin than without it (i.e., SPF 15 = 15x longer for the sun to redden you.

IBISWorld, a market research company, reports that sunscreen product sales grew 2.6 percent a year between 2011 and 2016, and generated $394 million annually (IBISWorld 2016)³

Effects by Age

Baby skin is thinner and absorbs more water
Infant and toddler skin has less melanin, which protects from UV light
The older you are the thicker and more pigmented you get, which is more protective
Very few studies are done on the effects on small children
Adults older than 60 are also more sensitive to sunlight

Tanning beds are BAD!

Emit up to 12x the UVA of the sun
People who use tanning beds are 1.5-2.5x more likely to get cancer.
The risk of melanoma goes up when you use a tanning bed at any age, but the International Agency for Research on Cancer calculates that if you start using tanning beds before age 30, your risk of developing melanoma jumps by 75 percent³.

Vitamin A is a bad ingredient

Vitamin A in the form of retinyl palmitate can harm skin when combined with sunlight. Luckily its usage has been falling.

Sprays are convenient, but not the best option

Inhaling the chemicals in the spray can be bad, most people apply too light of a coat, and people miss spots. Despite this their use is on the rise, increasing 27%.

High SPFs are deceiving²

Correctly applied SPF 50 blocks 98% of UVB rays; SPF 100 blocks 99%
The higher the SPF, the more UVB it blocks, but the less UVA it blocks
The way sunscreens are measured may not reflect real world conditions
- In lab measurements, small changes in light can change an SPF 100 sunscreens rating to SPF 37
People spend more time in the sun when they wear a higher SPF
Higher doses of ingredients may be harmful when absorbed into the skin
If you don’t apply enough, or misapply, an SPF 100 sunscreen’s actual rating could be as low as SPF 3.2. T-Shirts are SPF 5.
Most countries cap advertisements at 50+ (Europe, Japan, Canada, etc.); Australia caps at 30

European Sunscreens > American Sunscreens?

Several European companies have developed chemicals that are better at blocking UVA, but these have not yet been approved by the FDA. Europe also requires that the advertised SPF (which is its UVB rating) be no more than 3x the UVA rating.

Tips to Stay Safe in the Sun

Know how intense the sun is

Check a site like http://sunburnmap.com/

Know your ingredients and pick the right SPF

Know what protects you best. Check if a sunscreen’s claims are accurate, and check how harmful the ingredients may be at http://wsw.ewg.org/sunscreen/

FDA-Approved Sunscreens		Side Effects
Active Ingredient/UV Filter Name	Range Covered
	UVA1: 340-400 nm
	UVA2: 320-340 nm
	UVB: 290-320 nm
Chemical Absorbers:
Aminobenzoic acid (PABA)	UVB
Avobenzone	UVA1	Relatively high skin allergen
Cinoxate	UVB
Dioxybenzone	UVB, UVA2
Ecamsule (Mexoryl SX)	UVA2
Ensulizole (Phenylbenzimiazole Sulfonic Acid)	UVB
Homosalate	UVB	Slight skin penetration; disrupts some hormones
Meradimate (Menthyl Anthranilate)	UVA2
Octocrylene	UVB	Relatively high allergen
Octinoxate (Octyl Methoxycinnamate)	UVB	Slight skin penetration; acts like hormone in body; moderate allergen
Octisalate ( Octyl Salicylate)	UVB
Oxybenzone	UVB, UVA2	Penetrates skin significantly; acts like estrogen in the body; relatively high allergen
Padimate O	UVB
Sulisobenzone	UVB, UVA2
Trolamine Salicylate	UVB
Physical Filters:
Titanium Dioxide	UVB, UVA2	Inhalation concerns
Zinc Oxide	UVB,UVA2, UVA1	Inhalation concerns

Table From http://www.skincancer.org/prevention/uva-and-uvb

Follow these tips

Seek the shade, especially between 10 AM and 4 PM.
Do not burn.
Avoid tanning and UV tanning booths.
Cover up with clothing, including a broad-brimmed hat and UV-blocking sunglasses.
Use a broad spectrum (UVA/UVB) sunscreen with an SPF of 15 or higher every day. For extended outdoor activity, use a water-resistant, broad spectrum (UVA/UVB) sunscreen with an SPF of 30 or higher.
Apply 1 ounce (2 tablespoons) of sunscreen to your entire body 30 minutes before going outside. Reapply every two hours, or immediately after swimming or excessive sweating.
Keep newborns out of the sun. Sunscreens should be used on babies over the age of six months.
Examine your skin head-to-toe every month.
See your physician every year for a professional skin exam.
Don’t forget to sunscreen your lips

Most tips From http://www.skincancer.org/prevention/uva-and-uvb

At a glance, do these things:

Other places to protect yourself

Car windows block a lot of UVB, but not UVA
- Two studies found significantly more melanoma on the left side of the body/face, suggesting long exposure in cars puts you at more risk
- Car windshields block a lot of UVB and UVA because of the plastic in the middle (around SPF 50); side windows do not do so well (around SPF 16)
- Transparent window films block out almost 100% of both UVA and UVB
Skip the sunroof and convertible
Check office windows and skylights to see if they are glass or plastic and if they are treated with a UV film

Tips if you get a Sunburn¹⁷

Take frequent cool baths or showers to help relieve the pain. As soon as you get out of the bathtub or shower, gently pat yourself dry, but leave a little water on your skin. Then, apply a moisturizer to help trap the water in your skin. This can help ease the dryness.
Use a moisturizer that contains aloe vera or soy to help soothe sunburned skin. If a particular area feels especially uncomfortable, you may want to apply a hydrocortisone cream that you can buy without a prescription. Do not treat sunburn with “-caine” products (such as benzocaine), as these may irritate the skin or cause an allergic reaction.
Consider taking aspirin or ibuprofen to help reduce any swelling, redness and discomfort.
Drink extra water. A sunburn draws fluid to the skin’s surface and away from the rest of the body. Drinking extra water when you are sunburned helps prevent dehydration.
If your skin blisters, allow the blisters to heal. Blistering skin means you have a second-degree sunburn. You should not pop the blisters, as blisters form to help your skin heal and protect you from infection.
Take extra care to protect sunburned skin while it heals. Wear clothing that covers your skin when outdoors. Tightly-woven fabrics work best. When you hold the fabric up to a bright light, you shouldn’t see any light coming through.

Tips from https://www.aad.org/public/skin-hair-nails/injured-skin/treating-sunburn

Music

“Wear Sunscreen Commencement Speech” by Mike Harper, KNVE

Sources

E015 – BBQ Showdown (Pellet Grill vs Big Green Egg)

Sun, 30 Apr 2017 02:48:04 +0000

Join me and my special guest, Colby “meat whore” Pritchett (@colbypritchett) on this BBQ showdown where we pit the Big Green Egg against the Green Mountain Grill Pellet Smoker. We also cover the history, styles, stats, and health facets of different types of BBQ.

History

Bbq evolves from the spanish word ‘barbacoa’, but where the word actually originated is still debated.
BBQ dates back to the colonial era. George Washington even attended bbq’s
Woods commonly selected for their flavor include mesquite, hickory, maple, guava, kiawe, cherry, pecan, apple and oak. Woods to avoid include conifers. These contain resins and tars, which impart undesirable resinous and chemical flavors.
The most popular foods for cooking on the grill are, in order: burgers (85 percent), steak (80 percent), hot dogs (79 percent) and chicken (73 percent).
MAY is national BBQ month
Only 10% of grill owners have a backyard kitchen, equipped with premium furniture and lighting?
The longest barbecue measured 8,000 m (20,246 ft) and was created by the people of Bayambang, (Philippines), in Bayambang, Pangasinan, Philippines on 4 April 2014. The record attempt took place during the Malangsi Fish-tival in order to celebrate the 400th anniversary of the city Bayambang. The barbecue was made up of 8,000 grills connected to each other, each measuring 1m in length, 58 cm in height and 21 cm in width. 50,000 kg of fish, 2,000 kg of salt, 480 blocks of ice and 6,000 bags of charcoal were used. 8,000 people were involved.

Styles

There are different regional barbecue styles all across the country. Although they all cook their meat low and slow, that’s where the similarities stop. Some cook pig, some smoke different cuts of beef, some lamb, and some chicken. Sauces are also varied: some are vinegar and pepper-based; others utilize brown sugar and molasses; in some, mustard is the predominant flavor; and tomato is the primary flavor in others. While there are plenty of nuances and micro-regional styles, there are four styles that anyone who claims to be a barbecue lover should know about.

In North Carolina, barbecue revolves around the pig: the “whole hog” in the east and the shoulder in the west. The pork is chopped up and usually mixed with a vinegar-based sauce that’s heavy on the spices and contains only a small amount of tomato sauce, if any.

In Memphis, it’s all about the ribs. Wet ribs are slathered with barbecue sauce before and after cooking, and dry ribs are seasoned with a dry rub. You’ll also find lots of barbecue sandwiches in Memphis: chopped pork on a bun topped with barbecue sauce, pickles, and coleslaw.

Kansas City barbecue uses a wide variety of meat (but especially beef) and here it’s all about the sauce, which is thick and sweet. Kansas City is a barbecue melting pot, so expect to find plenty of ribs, brisket, chicken, and pulled pork there, all served with plenty of sauce. Brisket burnt ends are also a specialty here.

And there are a few different styles native to Texas, but the most famous variety is the Central Texas Hill Country “meat market” style: heavy on the beef brisket, which has been given a black pepper-heavy rub. Sauce and side dishes usually play second fiddle, because in Texas it’s all about the meat, be it ginormous beef ribs, pork ribs, chicken, brisket, or sausage.

– http://lehighvalleymarketplace.com/get-sauced-the-nations-top-bbq-regions/

Brisket Cuts

USDA Utility, Cutter, Canner Beef. These are the lowest grades of beef and used primarily by processors for soups, canned chili, sloppy Joe’s, etc. You will not likely see them in a grocery.
USDA Standard or Commercial Beef. Practically devoid of marbling. If it does not have a grade on the label it is probably standard or commercial. These grades are fine for stewed or ground meat, but they are a bad choice for the grill. About 2% fat.
USDA Select Beef. Slight marbling. If you know what you are doing you can make this stuff tender. Otherwise, get a higher grade. About 2 to 4% fat.
USDA Choice Beef. Noticeable marbling, but not a lot. This is a good option for backyard cooks. About half of all beef is marked USDA Choice. There are actually three numbered sublevels of USDA Choice. Certified Angus Beef (CAB) is limited to only the top two levels. Reliable sources tell me that Walmart “Choice Premium” is USDA Choice. The word “premium” is all about marketing and not to be confused with USDA Prime. 4-10% fat. A 12 ounce ribeye typically sell for about $8 to 10 retail at the time of this writing in 2010, and prices fluctuate depending on supply and demand as well as weather which impacts the cost of feed.
USDA Prime Beef. Significant “starry night” marbling. Often from younger cattle. Prime is definitely better tasting and more tender than Choice. Only about 3% of the beef is prime and it is usually reserved for the restaurant trade. About 10 to 13% fat, about $20-30 for a 12 ounce ribeye at retail. A dry aged steak can be15-18% fat and $30-35 or more for a 12 ounce ribeye.
Black Angus. Black Angus cattle are considered by many to be an especially flavorful breed. Alas, it is almost impossible to know if what you are buying really is Angus.
Certified Angus Beef. The Certified Angus Beef (CAB) brand is a trademarked brand designed to market quality beef. To wear the CAB logo, the carcass is supposed to pass 10 quality control standards and CAB must be either USDA Prime or one of the two upper sublevels of USDA Choice. Most of it is USDA Choice. CAB costs a bit more because the American Angus Association charges a fee to “certify” the cattle and higher markups take place on down the line.
Interestingly, CAB does not actually certify that the beef labeled Certified Angus Beef is from the highly regarded Angus breed. Their major control is that the cattle must have a black hide, which is a genetic indicator that there are Angus genes in the cattle, but not a guarantee.
Wagyu Beef. Wagyu cattle have Japanese blood lines and are now raised in the US and other countries. Their genetic heritage can be any of a number of Japanese cattle breeds. American Wagyu does not have to adhere to the standards as Kobe beef (below), and many of the Wagyu are cross bred with local breeds to make them better adapted to the local climates and diseases. Wagyu and Angus crosses are frequent, and they make mighty fine meat. Wagyu is usually extremely marbled, usually 4 to 10 BMS, more than USDA Prime, but not as much as Kobe, and the flavor and texture is distinctive. It is also about twice the price of USDA Prime. One can only wonder how long before the cross breeding and lack of enforceable standards dilute the quality.

Nutrition Facts

Brisket Sales

Beef Brisket unit sales (in millions of pounds)

2014 Brisket Sales by Holiday in US (millions of pounds)

538 – Where’s the Beef
- US Cattle Herds are shrinking (97mm in ’07 –> 88.5mm in ’14)
  - Fertilizer, fuel, and feed rose
  - Droughts hit
- Prices are rising

2016 Sales by Restaurant	Pecan Lodge	Ten 50 BBQ	Franklin’s Austin
Brisket	6,700	2,100	10,662
Sausage	1,525	2,000	1,200
Ribs			1,823
Mac & Cheese	4,000
Potato Salad			75
Beans	1,600		600
Peach Cobbler	340
Sides		600
Torpedos		6,500
Rolls/Bread	4,200	4,000
Notes	Brisket is their single largest expense – more than rent, electricity, etc.

Dickey’s – uses BigData and near real-time analytics of store data (synced every 20 min.) to analyze sales trends, inventory, etc.
- If ribs aren’t selling well, they can send a text message coupon out to affect sales
- Tools: iOLAP vendor, implemented Yellowfin BI and Syncsort DMX ETL on Amazon Redshift

Other Stats

75% of U.S. adults own a grill or smoker.
The majority of grill owners (63%) use their grill or smoker year-round and 43% cook at least once a month through winter.
Nearly a third of current owners plan to grill with greater frequency this year.
Barbecuing isn’t just an evening activity: 11% of grill owners prepared breakfast in the past year.
The five most popular days to barbecue, in order are: July Fourth; Labor Day & Memorial Day (tied); Father’s Day; Mother’s Day.
The top three reasons for cooking outdoors, in order are: to improve flavor; for personal enjoyment; for entertaining family and friends.
Gas grills are easily the most popular style, the choice of 62% of households that own a grill.

Pellet Grills

Traeger patent granted in 1986 and expired in 2006
Continuous fuel source like gas; indirect heating like a traditional smoker, so no flame ups, precise temperature control
For people who approach cooking as a science rather than an art (but there’s still art to it)
Induction fan makes grill like a convection oven
Hopper -> Auger -> firebox -> induction fan
Pro Tips:
- MAKE SURE YOU DON’T RUN OUT OF FUEL
- Have your vent open almost all the way
- Turn off in proper way to prevent clogs and lock-ups
  - Taking it apart to clean it is fraught with peril
- It may still have hot spots like any other grill or oven
- Use food grade pellets, not cheap ones for heaters (these can be scrap wood, shredded pallets, etc.)
- Wifi sounds cool and is, but sometimes it is temperamental and easier just to use w/o it, particularly if in a hurry
- Use your own remote thermometers to watch different parts of grill and multiple pieces of meat at once
- I still use a gas grill to do direct heat or searing
- Can use a thermal blanket to insulate during winter or in cold locations – will use less pellets when you do this
- Get the smallest grill you can stand. The bigger the grill the more pellets required to cook, so you may just be paying to heat air
What to look for:
- Variable temperature setting (not three positions)
- Hopper capacity
- Meat probes
- Shelves and hooks
- Wifi / smart phone connectivity – verify whether it is only on local Wi-fi or internet capable
- Larger temperature range offers more options for cold smoking, steaks, etc.
- Some have options for pizza stones, sear plates, etc.
Cool infographic about pellet grills

– Infographic from Grilling with Rich

Big Green Egg

The design is based on ancient clay cooking vessels up to 3,000 years old.
Kamado style clay pot grills with removable lids originated in Japan. Kamado means “cooking range” or “stove” in Japanese.
Very fuel efficient as they hold heat extremely well regardless of the weather. The fact that is holds heat and traps in moisture causes the meat to stay juicy and not dry out.
US Air Force servicemen started bringing Kamado style grills back to the US after World War II.
In the 1960s people started manufacturing them in the US.
Ed Fisher discovered these grills overseas and returned to the US to start the Big Green Egg company in 1974.

Health Tips

Grilling Danger #1: Char
- While char marks in grilled meat look appealing and give a tasty flavor, the char is laden with cancer-causing compounds called heterocyclic amines (HCAs) that form when meat and high heat are combined to create a blackened crust. The more char that’s created, the more carcinogens result that coat your food. High levels of HCAs can cause cancer in laboratory animals exposed to them, and epidemiological studies show that eating charred meats may be associated with an increased risk of colorectal, pancreatic and prostate cancer.
Grilling Danger #2: Smoke
- Barbecue smoke contains polycyclic aromatic hydrocarbons (PAHs), toxic chemicals that can damage your lungs. As meat cooks, drippings of fat hit the coals and create PAHs, which waft into the air. If you are a grill chef who loves to stand over the barbeque, you are inhaling these toxins. The smoky smell on your clothes and in your hair is also coating the inside of your lungs. The more your grill smokes, the more PAH is generated. The toxins are absorbed along with that delicious smoky flavor right into your food.
Grilling Danger #3: Harmful byproducts
- When food is cooked at very high temperatures, a chemical chain reaction can occur that creates inflammatory products called advanced glycation end products (AGEs) that are harmful to your cells and associated with cellular stress and aging. As suggested by the name ‘end product,’ your body cannot digest them or get rid of them easily. Over time, AGEs accumulate in your organs and cause damage. Where do you find AGEs in the barbeque? In the char.
How to avoid the dangers
- Use marinades and rubs – Coating the meat in herbs with a rub containing rosemary, thyme, pepper or smothering with thick marinades not only adds delicious flavor but can also help reduce the creation of carcinogens by grilling by up to 96%. A tasty marinade also reduces dripping fat and smoke and helps prevent char, thereby lowering the amount of all 3 threats – HCAs, PAH, and AGEs – in your food. Take home message: Boosting flavor can reduce risk.
- Pre-cook your meat – As easy way to decrease toxins created by the barbecuing is to pre-cook your meat halfway over low heat in a skillet or the oven before putting them on the grill. Precooking removes some of the fat that can drip and smoke, and it greatly reduces the amount of time your meat sits on the grill being exposed to toxins. Less time at high heat also means fewer AGEs are created in your meat. Extra bonus: with precooking, you can barbeque the food much faster to feed the hungry troops.
  - Marinate the food in alcohol before barbecuing it. According to research published by the Journal of Agricultural and Food Chemistry, soaking meat in a marinade of beer – especially stout or black beer – reduces the creation of PAHs (cancer-causing carcinogens) when it’s grilled by around 50%
- Reduce drippings – Using a simple piece of aluminum foil as a protective barrier under the meat helps prevent drippings from smoking, thereby reducing the amount of PAH blowing into your food and your lungs. Keeping drippings in the foil can also help to keep your food moist. Another great way to reduce drippings is to choose leaner cuts of meat and trim off any excess fat before you put them on the grill.
- Grill veggies – Grilled vegetables do not contain the HCA carcinogens even when charred. Vegetable kabobs made with peppers, cherry tomatoes and red onions are great on the grill, and offer many healthy nutrients and cancer fighting substances you can’t get from a steak or chicken breast.

Music: Good BBQ by the Riptones via FreeMusicArchive.org

Sources:

E014 – For the Love of Allergies

Tue, 28 Mar 2017 03:00:46 +0000

Achooo! Did you know that seasonal allergies affect about 50 million people in the US, penicillin kills about 400 people/year, and some people are allergic to cockroaches?! Learn all about allergies in this episode.

A note about this episode’s content:

Most of the allergy information in this episode is very short statistics that were commonly repeated in several sources. In many cases, I simply collected these statements and presented them below. Unless specifically noted below, please consider all the information as referenced from another source. See list of sources at the bottom of the show notes.

Allergies Defined

An allergy is when your immune system reacts to a foreign substance, called an allergen. It could be something you eat, inhale into your lungs, inject into your body or touch. This reaction could cause coughing, sneezing, itchy eyes, a runny nose and a scratchy throat. In severe cases, it can cause rashes, hives, low blood pressure, breathing trouble, asthma attacks and even death.1

There is no cure for allergies. You can manage allergies with prevention and treatment. More Americans than ever say they suffer from allergies. It is among the country’s most common, but overlooked, diseases.¹

Who is affected?

In about 50% of all homes in the U.S., there are at least 6 detectable allergens present in the environment.
Nasal allergies affect about 50 million people in the United States. (30% of adults, 40% of children)
Odds that a child with one allergic parent will develop allergies: 33%.
Odds that a child with two allergic parents will develop allergies: 70%.
Allergies are increasing and have been steadily for the past 50 years
Most common health issue for kids
Percentage of the U.S. population that tests positive to one or more allergens: 55%
Females are slightly more likely to have food allergies than males with percentages of reported reactions at 4.1 and 3.8 respectively.
Non-Hispanic white children have the highest percentage of reported food allergies at 4.1, non-Hispanic blacks at 4.0, and Hispanic children at 3.1.

Lethal enforcers

The most common triggers for anaphylaxis, a life-threatening reaction, are medicines, food and insect stings.
Medicines cause the most allergy related deaths.
African-Americans and the elderly have the most deadly reactions to medicines, food or unknown allergens.
Deadly reactions from venom are higher in older white men.
Over the years, deadly drug reactions have increased a lot.

It ain’t cheap

In 2010, Americans with nasal swelling spent about $17.5 billion on health costs.
They have also lost more than 6 million work and school days and made 16 million visits to their doctor.
Food allergies cost about $25 billion each year.

Heyyyyy… Fever (Allergic Rhinitis)

Worldwide, allergic rhinitis affects between 10 percent and 30 percent of the population.
7.8% of adults get hay fever
In 2010, white children were more likely to have hay fever than African-American children.
Global warming may have added four weeks to pollen season in the last 10-15 years

Allergies around the US

The Eczema-Allergy Connection⁹

Eczema can flare up when you are around allergies. Children with eczema are also more likely to have food allergies, such as to eggs, nuts, or milk. They often make eczema symptoms worse for kids but not for adults.

Genes – a gene flaw that causes a lack of a type of protein, called filaggrin, weakens that skin barrier and makes it easier for allergens to get into the body.
How the body reacts to allergens – people with eczema may have small gaps in the skin that make it dry out quickly and let germs and allergens into the body. Allergens cause inflammation and lead to eczema.
Too many antibodies – people with eczema have above average levels of Immunoglobulin E (IgE), a type of antibody that plays a role in the body’s allergic response.

Tips to avoid Hay Fever⁸

Reduce your stress – less stress = milder symptoms
Exercise more – a survey found that people who exercise have the mildest symptoms and this reduces stress, too. However, avoid exercising outdoors when the pollen count is high (early morning and early evening). Better yet, exercise indoors if symptoms are severe.
Eat well
1. Healthy diets = milder symptoms.
2. However, foods that can worsen hay fever symptoms for some people include apples, tomatoes, stoned fruits, melons, bananas and celery.
3. Eat foods rich in omega 3 and 6 essential fats which can be found in oily fish, nuts, seeds, and their oils. These contain anti-inflammatory properties, and may help reduce symptoms of hay fever.
Cut down on alcohol – beer, wine and spirits contain histamine, the chemical that sets off allergy symptoms in your body. Alcohol also dehydrates you, making your symptoms seem worse.
Sleep well = mildest symptoms. People who get seven hours of sleep or more report less symptoms than those getting five hours sleep or less a night.
Get pricked – Immunotherapy (allergy shots) helps reduce hay fever symptoms in about 85% of people with allergic rhinitis.³

Other allergies

Skin in the game

Skin allergies include skin inflammation, eczema, hives, chronic hives and contact allergies. Plants like poison ivy, poison oak and poison sumac are the most common skin allergy triggers. But skin contact with cockroaches and dust mites, certain foods or latex may also cause skin allergy symptoms.

In 2012, 8.8 million children had skin allergies.
Children age 0-4 are most likely to have skin allergies.
In 2010, African-American children in the U.S. were more likely to have skin allergies than white children.

That PB&J that is to die for…literally. (Food Allergies)

Children have food allergies more often than adults. Eight foods cause most food allergy reactions. They are milk, soy, eggs, wheat, peanuts, tree nuts, fish and shellfish.

Percentage of the people in the U.S. who believe they have a food allergy: up to 15%.
Percentage of the people in the U.S. who actually have a food allergy: 3% to 4%.
Peanut is the most common allergen. Milk is second. Shellfish is third.
Peanut and tree nut allergies affect about 1% of the US.
In 2014, 4 million children in the US have food allergies.
8% of children have a food allergy
- Also, 38.7 % of food-allergic children have a history of severe reactions.
- 30.4% are allergic to multiple foods.

Bad medicine (Drug Allergies)

Penicillin is the most common allergy trigger for those with drug allergies. Up to 10 percent of people report being allergic to this common antibiotic.
Penicillin kills about 400 people / year.
Bad drug reactions may affect 10 percent of the world’s population. These reactions affect up to 20 percent of all hospital patients.

No glove love

Only about 1 percent of people in the U.S. have a latex allergy.
However, health care workers are becoming more concerned about latex allergies. About 8-12 percent of health care workers will get a latex allergy.
Approximately 220 cases of anaphylaxis and 3 deaths per year are due to latex allergy.

Bug me not

People who have insect allergies are often allergic to bee and wasp stings and poisonous ant bites. Cockroaches and dust mites may also cause nasal or skin allergy symptoms.

Insect sting allergies affect 5 percent of the population.
At least 40 deaths occur each year in the United States due to insect sting reactions.
Adults are about 4x more likely to die from an insect sting than a kid. Basically, if you still have a reaction when you’re an adult, it affects you hard.
Venom immunotherapy is 97% effective in preventing insect sting reactions in sensitive patients

Music:

datagroove by Goto80

Sources:

013 – For the Love of Graph Databases

Tue, 21 Feb 2017 17:30:31 +0000

Where did graphs come from? (Graph Theory History)

In its simplest form, Graph Theory defines a graph as a construct made up of vertices, nodes, or points which are connected by edges, arcs, or lines.1 The connections may be directed, indicating a direction from one node to another, or undirected. Properties are attributes associated with nodes that describe the node in some detail.

Graph theory is applied in many disciplines from linguistics to computer science, physics, and chemistry. Popular uses will be discussed below. Leonhard Euler published “Seven Bridges of Königsberg” in 1736; this is commonly attributed as the first paper about graph theory. James Joseph Sylvester published a paper in 1878 where the term “graph” was first introduced. The first textbook was later published in 1936.1

There are various algorithms that define how to best traverse through a graph from one node to another based on the edges between them.

So what…I’ve never used a Graph Database.

Have you ever used Google? If so, then you’ve used the most well-known implementation of a graph database in recent times.
Google, Faceook, and LinkedIn all use proprietary forms of graph databases to underpin parts of their websites.

How Google uses Graphs

In the original 1998 academic paper that Sergey Brin and Lawrence Page wrote, they described PageRank, the graph portion of their first implementation of Google.

Basically, all webpages are treated as nodes. The hyperlinks between the pages are edges, and an algorithm assigns a weight to the credibility of each page. The more links a page has to credible sources, the higher that page’s credibility becomes. A search is a) broken down into a series of words, b) used to find pages that most closely correlate to those words, and c) page results are ranked according to their credibility, or PageRank.

As of mid-2016, the size of Google’s index as 130 trillion. Google has a nice infographic site on how search works here.

What’s so good about a graph database?

For use cases involving complex relationships and traversal of these, graphs make great choices. They can provide10:

Flexible and agile – a graph database should closely match the structure of the data it uses. This allows developers to start work sooner without the added complexity of mapping data across tables. Neo4J call this ‘whiteboard friendliness’ – meaning what you draw as the design on your whiteboard is how the data is stored in your database.

Greater performance – compared to NoSQL stores or relational databases, graph databases offer much faster access to complex connected data, mainly as they lack expensive ‘join’ operations. In one example, a graph database was 1000x faster than a relational database when working with a query depth of four.

[Caveat: I did not perform this comparison, but I imagine a properly indexed instance of an Oracle database could complete this query in a decent amount of time, perhaps not as fast as Neo4j, but I bet it would at least finish the query.]

Lower latency – users of graph databases experience lower levels of latency. As the nodes and links ‘point’ to one another, millions of related records can be traversed per second and query response time remains constant irrespective of the overall database size.

– Sample graph query

Good for semi-structured data – graph databases are schema free, meaning patchy data, or data with exceptional attributes, don’t pose a structural problem.

(All of these bullets above are from https://cambridge-intelligence.com/keylines/graph-databases-data-visualization/)

When should you use a graph database?

The most popular and hottest use cases of graph DBs at the moment are:

Social network connections
Credit card fraud analysis
Recommendation engines
Master Data Management (MDM) – i.e., 360-degree view of customer
Logistics planning for transportation, traffic, shipping, etc.
Computer/telecom network planning and analysis

These boil down to the following uses10:

Path finding: Their traversal efficiency make graph databases an effective path-finding mechanism. Links can be weighted, or assigned relative distances or times, to ascertain the shortest and most efficient routes between two nodes in a network.
Mapping dependencies: networks of computers and hardware can be modeled as graphs to find components with many dependents that may be potential weak points or vulnerabilities. Other dependency networks, for example corporate or investment structures can be mapped in a similar manner.
Communications: Communications between people can be stored as graphs. Applying network analysis measures can help find influential individuals.

All of these bullets above are from https://cambridge-intelligence.com/keylines/graph-databases-data-visualization/

The Panama Papers13,14

In 2016 11.5 million documents comprising 2.6TB of information were leaked from a Panama law firm (Mossack Fonseca). These documents were scanned and processed into the Neo4j graph database where investigative journalist used graph visualizations to uncover hidden insights and relationships that would have otherwise been missed.

See the articles at Neo4J for more information on how this information was analyzed.

What graph databases should I use9?

Neo4j is far and away the most popular graph database. Neo4j and several of the other top graph DBs are all open source. Below is the trend of popularity for these databases from DB-engines.com. Neo4j is first with a score of 36.27, followed by OrientDB (5.87) and Titan (5.08).

Rank			DBMS	Database Model	Score
Feb 2017	Jan 2017	Feb 2016			Feb 2017	Jan 2017	Feb 2016
1.	1.	1.	Neo4j	Graph DBMS	36.27	+0.00	+3.98
2.	2.	2.	OrientDB	Multi-model	5.87	+0.06	-0.55
3.	3.	3.	Titan	Graph DBMS	5.08	-0.42	-0.27

Tips for converting from a RDBMS to Graph (from Neo4j)12:

Each entity table is represented by a label on nodes
Each row in a entity table is a node
Columns on those tables become node properties.
Remove technical primary keys, keep business primary keys
Add unique constraints for business primary keys, add indexes for frequent lookup attributes
Replace foreign keys with relationships to the other table, remove them afterwards
Remove data with default values, no need to store those
Data in tables that is denormalized and duplicated might have to be pulled out into separate nodes to get a cleaner model.
Indexed column names, might indicate an array property (like email1, email2, email3)
Join tables are transformed into relationships, columns on those tables become relationship properties

Music:

Music for today’s podcast is Cyanos by Graphiqs Groove via FreeMusicArchive.org.

Sources:

012 -For the Love of Guns

Fri, 27 Jan 2017 10:15:09 +0000

Sheer Quantity

3% of gun owners own almost 50% of all civilian guns. These 7.7mm “super owners” own between 8-140 guns (on average 17)15
In 2013, U.S. gun manufacturers built 10,844,792 guns, and we imported an additional 5,539,539; the number dropped slightly in 2014.16
There are over 300 million guns owned by civilians (legal and illegal)11
The government holds approximately 2.7mm guns

NPR.org stacked bar chart showing firearms by type and year

History

National Firearms Act of 1934 (NFA)4

In 1934 Congress passed a law taxing the makers and distributors of firearms as a way to curtail the usage of weapons commonly used in gang activity at the time. It also required firearms to be registered with the Secretary of Treasury and compelled holders of unregistered firearms to register them and be subject to prosecution for having an unregistered firearm. This provision was ruled to have violated the 5th Amendment to the Constitution (against self-incrimination) in 1968. At this point the NFA was unenforceable.

Gun Control Act of 1968 (GCA)1

The assassination of JFK prompted this law because Oswald’s weapon was purchased from a mail-order catalog. The NRA supported this measure, and its passage in October 1968 came after recent assassinations of MLK and Robert Kennedy. The bill banned mail-order sales and prevented felons, drug users and mentally ill citizens from owning guns. The bill required firearms sellers to be licensed and prevented various interstate transactions unless they took place under a federally licensed dealer.

The bill established that persons over 18 could purchase rifles and shotguns, and one must have been over 21 to purchase a handgun. People would have to fill out Form 4473, the Firearms Transaction Record, when purchasing a gun from a dealer to certify that they are none of these prohibited parties6. The bill also required that all guns made or imported into the US bear a serial number and removal of that identifier became a felony offense. Furthermore, this bill closed the loophole in the NFA by preventing the registration of a firearm from being used as evidence in any crime occurring before the time of registration.

President Johnson, who asked for provisions of the bill, wanted it to also license individuals and said it fell short of protecting Americans at a time when 160 million guns existed in the US2. Johnson stated that the gun lobby defeated this measure. In 1993, the Brady Bill enhanced this by requiring more stringent background checks before selling a gun to a purchaser.

Firearm Owners’ Protection Act of 1986 (FOPA)3

In 1982 a Senate subcommittee report found that 75% of ATF prosecutions regarding firearms targeted ordinary, law-abiding citizens on technicalities or entrapment. This report and lobbying prompted the passage of FOPA in 1986. The law loosened restrictions on interstate gun sales and mailed ammunition, banned machine guns made after the bill passed from being sold to the general public, and limited ATF inspections to once a year, generally.

Registry Prohibition

A key part of FOPA was the restriction that the government cannot require firearms, their owners, or transactions involving firearms to be reported to any government entity.
The ATF is barred from consolidating or centralizing dealer records
- The bureau consolidated 252 million records of active shop owners from 2000-2016, but had to delete them after the GAO found they did not comply with FOPA
According to Pew Research Center, most Americans favor a federal database to track gun sales (70% overall, 55% Republicans)13

Traces:

1,500 / day or 373,000 / year in 2015 by 50 Bureau of Alcohol, Tobacco, and Firearms (ATF) agents
Urgent traces done in 24 hours; average trace takes 5 business days
These records are stored in 15,000 boxes
Required to be “unsearchable” – no keyword searches, sorting by date or anything else.
Some records are on toliet paper or napkins (a snub by shop owners who dislike the reporting requirement)
As of 2013, 70% of traces ID the buyer of a gun14
285 million records from closed up shops saved in 25 “data systems”7 known as the Firearms Tracing System (FTS)
- All bullets below are from https://en.wikipedia.org/wiki/Firearm_Owners_Protection_Act
- Multiple Sale Reports. Over 460,000 (2003) Multiple Sales reports (ATF F 3310.4 – a registration record with specific firearms and owner name and address – increasing by about 140,000 per year). Reported as 4.2 million records in 2010.
- Suspect Guns. All guns suspected of being used for criminal purposes but not recovered by law enforcement. This database includes (ATF’s own examples[citation needed]), individuals purchasing large quantities of firearms, and dealers with improper record keeping. May include guns observed by law enforcement in an estate, or at a gun show, or elsewhere.[citation needed] Reported as 34,807 in 2010.
- Traced Guns. Over 4 million detail records from all traces since inception. This is a registration record which includes the personal information of the first retail purchaser, along with the identity of the selling dealer.
- Out of Business Records. Data is manually collected from paper Out-of-Business records (or input from computer records) and entered into the trace system by ATF. These are registration records which include name and address, make, model, serial and caliber of the firearm(s), as well as data from the 4473 form – in digital or image format. In March, 2010, ATF reported receiving several hundred million records since 1968.
- Theft Guns. Firearms reported as stolen to ATF. Contained 330,000 records in 2010. Contains only thefts from licensed dealers and interstate carriers (optional). Does not have an interface to the FBI’s National Crime Information Center (NCIC) theft data base, where the majority of stolen, lost and missing firearms are reported. See eTrace below.

Hawaii & the “Rap Back” FBI database12

In 2016, Hawaii became the first state to require gun owners names to be posed to the FBI “Rap Back” database. This allows them to be notified if a gun owner from their state is arrested for a crime anywhere in the US.
Visitors to Hawaii packing heat must register and be placed on the list, but they request to be removed from the database after departure.

eTrace

When a gun is recovered at a crime scene, it can and usually is ran through a firearms trace with the ATF. This is done through a system called eTrace.
eTrace is a digital system that tracks submissions and trace results
It is more dynamic and usable because once a gun is in this system, it may be searched by owner name, serial number, etc.
However, non-crime scene guns are not in this system

ATF 2014 Firearms Trace Data10

The top 10 states with the most recoveries and traces are:

California
Florida
Texas
Illinois
Georgia
North Carolina
Ohio
Pennsylvania
Maryland
New York

In 2014, the number of firearms recovered and traced = 246,087

Top Categories of Recovered Firearms

Pistol – 131,562
Revolver – 43,799
Rifle – 38,854
Shotgun – 29,970
Derringer – 2,197
Receiver/Frame – 1,301
Machinegun – 717
The national possessor age is 36 years old.
In 2014, pistols and revolvers accounted for the majority of traced firearms.
National time-to-crime average is 10.88 years.

Sources:

Music

Gunslinger by The Long Ryders via FreeMusicArchive.org

011 – Top 10 Data Predictions for 2017

Fri, 30 Dec 2016 22:12:37 +0000

Happy New Year!
Thank you to all listeners and subscribers for your support this past year.

10 – Data borders will break down – logical data lakes and logical data warehouses will grow as companies embrace data virtualization products like Denodo. Data preparation tools, like the new Project Maestro from Tableau, will allow people to seamlessly pull from a) on-premise databases and excel files; b) cloud repositories like Redshift and BigTable; and c) hosted products like Workday and Salesforce.

9 – Data Quality and “Refined” data sets will become more important – with the uptick in BigData, sensor data, and data lakes, users will have a glut of information at their disposal (some have this already). Automated solutions that assess data quality or specially created intermediate data sets will become more and more important6. In many Data Lake architectures and Hadoop based ecosystems, curated or moderately processed datasets are becoming the norm for widespread usage by the enterprise. Data scientists and power users will continue to harness raw data sets for their explorations, but these refined data sets will be used to reduce heavy lifting and “recreating the wheel” for many analysts.

8 – Collaborative BI and analytics will become more mainstream – Sites like data.world and collaborative features in products such as Tableau will be embraced by more users than ever before in 2017. Taking cues from social media, these tools and techniques will produce more living datasets and visualizations with near real-time data as static reporting continues to decline as a percentage of overall reporting. Users will interact with each other and gain economies of scale by not reinventing the wheel when someone else has already done the heavy lifting.

7 – Internet of Things (IoT) will continue to expand – Currently, most firms use an age or time-based approach to maintain and replace equipment. Up to 50% of spend using this approach may be wasted, according to ARC Advisory Group3. This study also found that 82% of failures occur randomly. New sensors will be deployed and real-time data will continue to swing upward across many industries. Businesses will be able to use this data to respond to events like power outages as they occur and use predictive analytics and historical information for preventative maintenance. Using this data will allow companies to move from a time-based or cyclical check schedules to an event-based ones that can detect even small changes in performance that may spell trouble.

6 – Converged Intelligence will improve our lives – the trend for companies to share datasets and provide APIs to their services will enable more collaborative experiences to help customers and differentiate companies from their competitors. Services like IFTT (If this, then that) will offer more and more connections, largely driven by community contributions. Partnerships like SolarCity, Nest, and the Tesla Powerwall will share data to produce synergies that can save money and reduce energy dependence. People will leverage internet of things (IoT) [see #1 above] devices and home automation like SmartThings to make us more comfortable. Whether it is automatically adjusting your lights, TV, and devices when you want to watch a movie or automatically adjusting your Thermostat when you leave and arm your alarm, connected living will grow.

A word of caution: data sharing may be open and driven by users opting-in, but in some instances it will be hidden and used to exploit customers without their knowledge.

5 – Data breaches will continue – Stakes are getting higher as hackers attempt to sway political campaigns, ransomware is on the rise, and data breaches are increasing. As data becomes more open and shareable, attack vectors are much greater and opportunities are higher. Enterprises need to make sure they are vetting cloud and hosted solutions properly to make sure they are secure, but they also need to realize that cloud providers may be able to provide economies of scale and make data safer than individual organizations can on their own.

4 – So…Security will have to get more proactive – As hackers start to use IoT and continue DDOS, companies need to work together to defend against threats. Tools like Watson for Cyber Security will user in this new era. We will move from predictive analytics into cognitive to discover threats, identify all assets exposed, and then perform a second-order threat analysis to see what other services may suffer or what may be targeted next. These tasks can be performed by machine clusters faster and more completely than an army of analysts.

3 – You’ll continue to hear about blockchain initiatives, but it will be mostly hype in 2017 – According to Gartner, Blockchain is nearing the peak of the hype cycle4. However, I think other items close to the peak, like home automation and IoT will see more adoption than blockchain. IMHO, these others can be adopted on a smaller scale and are more readily available to the general public than blockchain related deployments. Many people are forecasting that blockchain related tech won’t hit mainstream for another 5-10 years5. Nevertheless, the concept and some early uses of it are pretty interesting, such as Smart Contracts. Also, friendly FYI, something that uses a blockchain is not automatically anonymous, as in the case of bitcoin.

2 – The line between Data Scientist and analyst/programmer will blur even more – analysts and programmers will take special courses here and there to beef up their statistics and data science chops. I think the demand for data scientists will bifurcate in 2017: a subset of organizations will spring for data scientists and the high salaries they command; however, the majority of firms will push for their analysts or tools to do low level data science work. Tools like Tableau and R Studio are making it easier for analysts to dabble in statistical and predictive analytics. Firms, such as New Knowledge, are offering “Data Scientist as a Service”, and tons of online courses, e-books, and knowledge bases have sprung up to spread data science fundamentals to the masses.

1 – BYOT, Bring Your Own Tool, will continue to gain momentum – Enterprises can no longer place all their eggs in one basket when it comes to a BI or reporting tool. Tools such as Tableau have proven their ability to uproot entrenched stalwarts like IBM Cognos, and traditional BI tools appear stale and financially infeasible compared to a plethora of specialized, cheaper alternatives. Traditional BI tools will still have their place in firms that have enterprise-level agreements and are slow to change, but as more and more users demand features that these tools can’t support, or go out and acquire alternatives through “shadow procurement”, the traditional tools and expertise in firms will erode. It is now more important than ever for IT organizations to focus on architectures that make a wide array of data available to the entire organization regardless of device or access tool of choice. Good governance policies and data czars needs to focus on data quality, establishment and maintenance of metadata, and publishing best practices around the types of tools and reports/visualizations that are best for specific scenarios. Firms need to evaluate the benefits of having multiple tools and the flexibility and productivity it gives their employees vs. the supportability and procurement benefits of working with a smaller number of providers.

Music: Auld Lang Syne by Fresh Nelly, from Free Music Archive.

Sources:

010 For the Love of Thanksgiving – For the Love of Data

Fri, 25 Nov 2016 07:40:32 +0000

Holiday Weight Gain Studies

Studies are very mixed on whether holiday eating causes weight gain.

The Good1

In several studies over the last thirty-one years, subjects gained approximately 3/4 to 2 lbs. during the holiday season
However, in one study participants felt they had gained 4x as much weight as they actually gained
Two other key finding:
- Although the amount of weight gained between the holidays was small, it represented the majority of the weight gained for the year
- Weight gained between the holidays typically is not lost the next year (it represents the annual amount of increase for many people).

The Bad3

During “eating holidays”, like Thanksgiving and Christmas, participants consumed 14% more than on normal days
Some participants (outliers perhaps?) consumed over 900 calories more on special occasions than normal days
Obese individuals indulged at an even higher level during holidays

Other

Children tend to gain more weight over the summer when school is out than during the holidays2

Theme music for this month’s episode is “Turkey Time” by Monk Turner4.

Sources:

009 For the Love of Algorithms – For the Love of Data

Mon, 31 Oct 2016 03:30:11 +0000

Worst Pun Ever: Today, we are talking about Al… Al Gore… Al Gore Rhythms… Algorithms!

Definition: a step-by-step procedure for solving a problem or accomplishing some end especially by a computer¹

Inputs:

Many algorithms use census data or FICO score as one of their prime inputs
Plus any custom information you give a website
Plus any information they glean about you from other sites (when you visit a site with a Facebook Share button, Facebook can track that you’re there¹⁷)
Websites are constantly looking at ways to break our anonymity (fingerprinting) so they can track us and serve us more relevant or lucrative ads⁵.

Fun Stuff

Chess – algorithms are so good that humans haven’t been able to beat a 4- CPU PC since about 2005⁶
Rubik’s Cube – the machine record is 0.887 seconds vs. just over 5 seconds for a human⁷

Poker – scientists solved all moves for Heads Up Limit Hold ‘Em – 3.16 x 10^17 moves. You may be able to win individual games, but it is HIGHLY unlikely that you can win over time⁸

Machine Learning

Can include intentional or unintentional bias.
@JonathonMorgan did a post on Medium and a podcast on Partially Derivative about using a machine learning model to find alt-right white supremacists on Twitter and track their degree of radicalization over time. He did this by training a model with their tweets and analyzing their usage of words like “Jewish” vs. more mainstream usage^3,4.

– From Medium / Jonathon Morgan’s Post

Pricing²:

Amazon lists its results over competitors, even when higher including shipping for non-prime customers; however, it claims it’s algorithm is customer-centric
Princeton Review charges between $6,600 and $8,400 for its online course in some zip codes. It charged higher in zip codes with higher incomes and some with higher Asian populations.

News/Search:

Link Analysis, how two entities relate to each other is used by Google’s PageRank, Facebook’s News Feed, and LinkedIn’s job/connection recommendations. It was developed in 1976 and first used by two other search indexes before Google began using it in 1998.¹³
However, algorithms cause sites to cater to information similar to your preexisting views, or for what they think you will find interesting, rather than presenting balanced, holistic content.¹⁴
- Medium recommends articles based on how long it thinks you will read.
- Some sites tailor related content, content types (video, etc.), and sharing buttons based on where you enter their site from.
- These choices and filters can lead you into a content bubble that leads you down a path of more and more specific, and sometimes extreme, viewpoints.
Facebook uses hundreds of features, or input variables, when assigning a relevancy score to posts you see in your news feed.¹⁵
When you have a Facebook account and you visit a page that has a like or share button, Facebook can log your visit and use that to tailor content or ads when you visit their site.¹⁶ See here¹⁷ for a relatively up-to-date list of features used in Facebook’s newsfeed algorithm (time spent viewing, friend’s posts receive priority, likes/reactions, etc. are all key inputs).

Serious Consequences

Some algorithms for car insurance weight FICO credit scores higher than drunk driving convictions.^9,10,11
Cathy O’Neil calls them “Weapons of Math Destruction” (WMDs) if they are: widespread, secretive, and have the potential to do great harm^9,10
Kronos, a small big data HR company hired by large firms to screen applicants employs a personality test as part of their screening of candidates. Some argue that this unfairly excludes them from jobs, with no explanation of the reason, in a manner that violates the American’s with Disabilities Act (ADA).^9,10,12

Theme Music: Algorithm of Desire by Measles Mumbs Rubella, courtesy of FreeMusicArchive.

Sources:

008 For the Love of Politics – For the Love of Data

Wed, 28 Sep 2016 15:27:27 +0000

History of Data in Politics

First off, 538’s podcast, What’s the Point did a great four part series on the data of politics that covered the history of politics from the late 1800s through the primaries. So please check out the above links for more context behind this. A brief history of data in politics shows the major ways candidates appealed to constituents progressed along this path:

Party Elite chose candidates.
Direct outreach – Candidates engage voters directly, including things like voting after parties offering booze to those who voted for a particular candidate.
TV – When TV came along, suddenly candidates could reach the majority of voters just by running ads on three networks.
Direct Mail – Politicians could use subscriber lists from certain magazines to target specific groups that might be interested in their policies.
Micro Targeting – This started around 2004 where data analysis identified target demographics to go after, advertise, and appeal to.
Individual Targeting – Howard Dean (2004) was a pioneer in this effort, coalescing state voter lists together and appending commercial data. This continued in 2008 with Barack Obama where they found that there was still a significant diversity in micro groups. The trend was refined in 2012 where campaigns used individual data to feed into, test, and refine their models.

However, there are roots back to 1891 when James Clarkson, the RNC chairman, assembled a file that featured the “age, occupation, nativity, residence and all the other facts in each votersʼ life, and had them arranged alphabetically, so that literature could be sent constantly to every voter directly.”10

The Obama campaign in 2008 and 2012 hired enormous amounts of staffers — 342 in the 2012 race alone in technology, digital data and analytics.

History of Voting Trends by State7

All Elections since 1876 (the year Texas A&M was founded, whoop!)

Most Democratic (1932)

Most Republican (1972)

Voter Turnout Rates

In voter turnout data by country since 2000, the US ranks #159 out of #196 with just over 55% average voter turnout. We can and should do better.

Rank	Average of Voter Turnout (%)	# of Data Points	Min Year	Max Year	Country
1	99.8	3	2002	2011	Lao People's Dem. Republic
2	99.3	4	2002	2016	Viet Nam
3	97.9	3	2003	2013	Rwanda
4	96.5	1	2004	2004	Equatorial Guinea
5	95.1	3	2003	2013	Cuba
6	94.1	5	2001	2013	Australia
7	94	3	2003	2013	Malta
8	93.8	4	2001	2015	Singapore
9	91.3	3	2004	2013	Luxembourg
10	91.2	3	2002	2008	Faroe Islands
11	91	3	2002	2012	Bahamas
12	90.3	4	2003	2014	Belgium
13	89.9	4	2000	2015	Tajikistan
14	89.8	4	2000	2015	Ethiopia
15	89.8	5	2000	2016	Nauru
16	89.7	3	2004	2014	Uruguay
17	87.4	3	2004	2013	Turkmenistan
18	87.2	3	2004	2014	Antigua and Barbuda
19	87.1	3	2004	2014	Uzbekistan
20	86.4	5	2001	2015	Denmark
21	85.1	3	2005	2013	Aruba
22	84.6	4	2002	2014	Bolivia
23	84.5	4	2003	2013	Iceland
24	84.4	4	2001	2013	Liechtenstein
25	84.1	4	2002	2015	Turkey
26	83.5	5	2000	2016	Peru
27	83.3	2	2001	2007	Timor-Leste
28	83.1	4	2002	2014	Sweden
29	82.4	4	2004	2014	Tunisia
30	81.9	3	2006	2014	Cook Islands
31	81.6	3	2004	2014	Guinea-Bissau
32	81.6	3	2002	2011	Seychelles
33	81.5	4	2001	2016	Cyprus
34	80.2	4	2001	2013	Italy
35	80.1	2	2005	2013	Cayman Islands
36	80	1	2002	2002	Tuvalu
37	79.4	3	2002	2012	Sierra Leone
38	79.2	3	2004	2014	Botswana
39	79.1	4	2002	2013	Austria
40	78.6	4	2002	2014	Brazil
41	78.6	4	2000	2014	Mauritius
42	78.4	2	2004	2014	Namibia
43	78.2	3	2004	2013	Malaysia
44	77.9	4	2001	2013	Chile
45	77.9	5	2002	2012	Netherlands
46	77.8	2	2001	2016	Samoa
47	77.7	1	2006	2006	Palestinian Territory, Occupied
48	77.6	5	2002	2014	New Zealand
49	77	3	2003	2013	Monaco
50	76.9	1	2012	2012	Papua New Guinea
51	76.9	4	2001	2013	Norway
52	76.7	3	2004	2014	Indonesia
53	76.6	2	2011	2015	Gibraltar
54	76.6	3	2001	2014	Fiji
55	76.4	4	2001	2015	Guyana
56	76	3	2005	2014	Maldives
57	75.8	3	2004	2014	South Africa
58	75.6	4	2003	2015	Belize
59	75.6	3	2003	2013	Cambodia
60	75.6	7	2001	2015	Argentina
61	75.5	4	2000	2012	Belarus
62	75.5	5	2000	2016	Mongolia
63	75.4	5	2001	2015	Andorra
64	75.2	3	2003	2013	Grenada
65	75.1	2	2008	2012	Angola
66	75	1	2003	2003	Yemen
67	74.8	5	2001	2013	Philippines
68	74.8	4	2002	2013	Germany
69	74.1	2	2005	2011	Liberia
70	74	4	2000	2012	Ghana
71	73.9	3	2002	2010	Sao Tome and Principe
72	73.8	3	2004	2014	Panama
73	73.8	3	2003	2012	Bermuda
74	73.6	3	2001	2011	Nicaragua
75	73.5	2	2010	2015	Myanmar
76	73.5	4	2000	2015	Anguilla
77	73.3	5	2000	2015	Sri Lanka
78	72.8	3	2002	2013	Togo
79	72.7	3	2005	2015	Burundi
80	72.5	4	2001	2014	Montserrat
81	71.9	6	2000	2016	Spain
82	71.4	1	2015	2015	Comoros
83	70.9	4	2002	2013	Ecuador
84	70.7	3	2002	2013	Kenya
85	70.5	3	2001	2014	Bangladesh
86	70.5	6	2000	2015	Greece
87	69.7	4	2000	2015	Saint Kitts and Nevis
88	69.6	3	2006	2012	Montenegro
89	69.6	4	2003	2015	Virgin Islands, British
90	69.5	4	2001	2012	San Marino
91	69.4	5	2002	2016	Kazakhstan
92	68.4	6	2001	2014	Thailand
93	68.2	3	2002	2013	Cameroon
94	67.9	4	2002	2014	Costa Rica
95	67.8	2	2002	2013	Guinea
96	67.5	1	2007	2007	Kiribati
97	67.5	3	2005	2014	Iraq
98	67.2	4	2001	2015	Saint Vincent and The Grenadines
99	66.9	2	2005	2011	Central African Republic
100	66.9	6	2000	2015	Trinidad and Tobago
101	66.8	3	2007	2013	Bhutan
102	66.5	4	2003	2015	Finland
103	66.4	6	2001	2015	Israel
104	66.3	4	2001	2016	Uganda
105	66.2	4	2005	2014	Tonga
106	66.1	4	2002	2016	Ireland
107	66.1	4	2002	2014	Hungary
108	65.9	3	2001	2013	Mauritania
109	65.9	3	2003	2013	Paraguay
110	65.6	4	2000	2015	Suriname
111	65.3	4	2001	2014	Solomon Islands
112	65.1	3	2007	2015	Oman
113	65	5	2001	2016	Taiwan
114	64.7	2	2006	2011	Congo, Democratic Republic of
115	64.4	5	2002	2016	Vanuatu
116	64.4	5	2006	2013	Kuwait
117	64.3	3	2001	2011	Zambia
118	64.2	4	2002	2015	Burkina Faso
119	63.9	4	2003	2015	Guatemala
120	63.5	3	2002	2010	Netherlands Antilles
121	63.4	3	2003	2013	Djibouti
122	63.3	1	2008	2008	Nepal
123	63.2	4	2001	2015	United Kingdom
124	63	5	2002	2014	Latvia
125	63	5	2002	2014	Macedonia, former Yugoslav Republic (1993-)
126	62.6	6	2000	2015	Canada
127	62.6	4	2001	2016	Cape Verde
128	62.6	5	2000	2015	Croatia
129	62.5	5	2001	2014	Moldova, Republic of
130	62.4	5	2000	2015	Kyrgyzstan
131	62.3	5	2000	2014	Slovenia
132	62	4	2003	2015	Estonia
133	61.8	3	2003	2013	Barbados
134	61.7	5	2002	2014	Ukraine
135	61.6	4	2002	2014	Bahrain
136	61.5	6	2000	2014	Japan
137	61.5	2	2000	2003	Yugoslavia, FR/Union of Serbia and Montenegro
138	61.4	3	2004	2014	Malawi
139	61.1	4	2000	2015	Tanzania, United Republic of
140	61.1	4	2002	2013	Czech Republic
141	60.9	3	2004	2014	India
142	60.5	5	2002	2016	Slovakia
143	60.1	5	2002	2015	Portugal
144	59.8	3	2003	2011	Russian Federation
145	59.3	3	2003	2012	Armenia
146	59.3	2	2002	2013	Madagascar
147	59.3	4	2003	2012	Georgia
148	59.2	2	2010	2015	Sudan
149	59.2	4	2003	2015	Benin
150	58.6	3	2002	2012	France
151	57.9	4	2002	2016	Dominican Republic
152	57.9	3	2008	2016	Iran, Islamic Republic of
153	57.8	5	2007	2016	Serbia
154	57.6	4	2000	2014	Dominica
155	57.3	5	2001	2014	Bulgaria
156	57.1	4	2003	2016	Syrian Arab Republic
157	57	5	2000	2014	Bosnia and Herzegovina
158	55.9	4	2001	2013	Honduras
159	55.7	8	2000	2014	United States
160	55.5	4	2000	2015	Venezuela
161	55.3	4	2003	2013	Jordan
162	55.1	5	2000	2016	Korea, Republic of
163	55.1	4	2002	2016	Jamaica
164	54.8	3	2000	2012	Palau
165	54.5	2	2002	2011	Chad
166	53.4	3	2004	2016	Niger
167	53.1	4	2002	2015	Lesotho
168	52.1	6	2000	2015	Mexico
169	51.9	4	2001	2013	Albania
170	51.7	2	2012	2014	Libya
171	51.4	4	2000	2012	Lithuania
172	51.2	4	2000	2012	Romania
173	50.8	5	2000	2015	Azerbaijan
174	48.9	3	2001	2011	Saint Lucia
175	48.6	2	2007	2013	Micronesia, Federated States of
176	48.5	3	2000	2009	Lebanon
177	48.1	5	2001	2015	Poland
178	48	2	2007	2015	Marshall Islands
179	47.8	4	2003	2015	Switzerland
180	47.6	2	2005	2010	Afghanistan
181	46.7	3	2002	2013	Pakistan
182	46.2	3	2001	2012	Senegal
183	45.6	3	2000	2008	Zimbabwe
184	45.3	4	2004	2014	Kosovo
185	44.7	3	2002	2011	Morocco
186	43.7	5	2000	2015	El Salvador
187	43.2	3	2004	2014	Mozambique
188	42.6	4	2002	2014	Colombia
189	41.6	3	2002	2012	Algeria
190	40.5	3	2003	2015	Nigeria
191	39.2	3	2002	2012	Gambia
192	36.5	4	2005	2015	Egypt
193	34.3	1	2011	2011	Gabon
194	34.1	2	2000	2011	Côte d'Ivoire
195	32.2	4	2000	2015	Haiti
196	31.8	3	2002	2013	Mali

30+ vote much at much higher rates than younger voters.
The more educated you are, the more likely you are to vote.
Most commonly black or white; Hispanics are the lowest consistently since 1984.
White share has been declining, but is still an overwhelming 77% of the vote8.

Sources:

007 For the Love of Olympics – For the Love of Data

Thu, 25 Aug 2016 07:30:09 +0000

Fun Fact: The main riff in NBC’s Olympic them is from Bugler’s Dream (1958) by Leo Arnaud.

History

Most believe games started in 776 BC as part of a religious festival in Greece to honor Zeus; however, some evidence suggests it could have started as early as the 10th century BC
The stadion race was the first event, a 600 foot race. This may have been the only event for the first 13 Olympics
They occurred every four years for twelve centuries, until 396 AD; then there was a break in games until 1896

How to Qualify for the Olympics

Individual:
- For each gender, up to three people per country can attend if they meet the entry standard
- For each gender, one person per country can attend if no one meets the standard
Team: Each country may send one team that meets the entry standard
Slightly more complicated criteria for relays and marathon – generally involving your finish in various qualifying events

Fun Fact: The marathon was not added until 1896 in Athens and was standardized at 26.2 miles in the 1908 London games because that was the distance between Windsor Castle and White City Stadium.

Cost of the Games

Many people feel the Olympics are a terrible investment for the host country. Rio’s estimated cost was $3bn, but it is projected to be at least 50% over budget at approximately $4.6bn.

Metric	Value
BRL / month	1972
BRD to USD	0.31
USD / month	611.32
USD / year	7335.84
Estimated cost of games	4600000000
Cost in # of yearly salaries	627058.39
Population	209567920
Cost in % of population	0.002992

Sochi is the most expensive so far, but Summer games are typically more expensive than Winter.

Fun Fact: The first Winter games were in 1924 (Chamonix).

Who are the Athletes?

Country Rankings

Country	Population	Athletes	Rank by # of Athletes
United States	324118787	563	1
Brazil	209567920	483	2
Germany	80682351	440	3
Australia	24309330	428	4
France	64668129	408	5
United Kingdom (Great Britain)	65111143	372	6
China	1382323332	352	7
Canada	36286378	320	8
Japan	126323715	312	9
Spain	46064604	312	10

Country	Population	Athletes	Rank by Population
China	1382323332	352	1
India	1326801576	123	2
United States	324118787	563	3
Indonesia	260581100	28	4
Brazil	209567920	483	5
Pakistan	192826502	7	6
Nigeria	186987563	77	7
Bangladesh	162910864	7	8
Russian Federation	143439832	283	9
Mexico	128632004	125	10

Country	Population	Athletes	Rank Per Capita
Republic of the Cook Islands*	20948	9	1
Palau	21501	5	2
Nauru	10263	2	3
San Marino	31950	5	4
British Virgin Islands*	30659	4	5
Bermuda*	61662	8	6
Saint Kitts and Nevis	56183	7	7
Seychelles	97026	10	8
Tuvalu	9943	1	9
Antigua and Barbuda	92738	9	10

Fun Fact: The flame started at the 1928 Amsterdam games.

Gender & Age Breakdown

Who are the Oldest and Youngest of All Time?

Category	Male	Female
Oldest Competitor	Oscar Swahn (Sweden) Age 72 1920, Shooting	Lorna Johnstone (UK) Age 70 1972, Equestrian
Oldest Gold Medalist	Oscar Swahn (Sweden) Age 64 1912, Shooting	Lida "Eliza" Pollock (USA) Age 63 1904, Team Archery (Bronze)
Oldest Medalist	Oscar Swahn (Sweden) Age 72 1920, Shooting (Silver)	Lida "Eliza" Pollock (USA) Age 63 1904, Archery (Bronze)
Youngest Gold Medalist	Klaus Zerta (Germany) Age 13 1960, Rowing	Donna Elizabeth de Varona (USA) Age 13 1960, Swimming - Team
Youngest Medalist	Dimitrios Loundras (Greece) Age 10 1896, Gymnastics - Team (Bronze)	Luigina Giavotti (Italy) Age 11 1928, Gymnastics - Team (Silver)

Fun Fact: Boxing and wrestling were added in 708 BC and 688 BC respectively.

A Look at the Medals

Summer Medal Values & Rewards

Gold: $600 (The gold medal consist of just 1% of actual gold, 92.5% silver and 6.16% copper).
Silver: $325 (While in silver medal, the gold is replaced by more copper, the rest of the material is the same like gold medal)
Bronze: $3 (Bronze medal however is 97% copper and 2.5% zinc and 0.5% tin)

Who are the Big Winners at Rio 2016?

Total Medals

Italy and Canada had a strong showing in total medals, but fell off in gold medals
Top 10 controlled almost 60% of total medals

Country	Total Medals	% of Total Medals	Running Total	Rank
United States	121	0.12422997946612	0.12422997946612	1
China	70	0.071868583162218	0.19609856262834	2
United Kingdom (Great Britain)	67	0.068788501026694	0.26488706365503	3
Russian Federation	56	0.057494866529774	0.3223819301848	4
Germany	42	0.043121149897331	0.36550308008214	5
France	42	0.043121149897331	0.40862422997947	5
Japan	41	0.042094455852156	0.45071868583162	6
Australia	29	0.029774127310062	0.48049281314168	7
Italy	28	0.028747433264887	0.50924024640657	8
Canada	22	0.02258726899384	0.53182751540041	9
Korea, South	21	0.021560574948665	0.55338809034908	10

Fun Fact: If Texas were a country, it would rank 8th for # of medals in the 2016 Summer Olympics.

Total Gold

Brazil and Argentina won many golds, but few others
Top 10 controlled 70% of total golds

Country	Gold	% of Gold Medals	Running Total	Rank
United States	46	0.14983713355049	0.14983713355049	1
United Kingdom (Great Britain)	27	0.087947882736156	0.23778501628664	2
China	26	0.084690553745928	0.32247557003257	3
Russian Federation	19	0.061889250814332	0.38436482084691	4
Germany	17	0.055374592833876	0.43973941368078	5
Japan	12	0.039087947882736	0.47882736156352	6
France	10	0.03257328990228	0.5114006514658	7
Korea, South	9	0.029315960912052	0.54071661237785	8
Netherlands	8	0.026058631921824	0.56677524429967	9
Australia	8	0.026058631921824	0.5928338762215	9
Hungary	8	0.026058631921824	0.61889250814332	9
Italy	8	0.026058631921824	0.64495114006515	9
Brazil	7	0.022801302931596	0.66775244299674	10
Spain	7	0.022801302931596	0.69055374592834	10

Percentage of Medal Type by Country

Six countries won nothing but Gold
Fiji and Argentina dominated in Golds as a % of total medals

Country	Gold	Silver	Bronze	Total Medals
Puerto Rico*	1	0	0	1
Singapore	1	0	0	1
Tajikistan	1	0	0	1
Kosovo	1	0	0	1
Jordan	1	0	0	1
Fiji	1	0	0	1
Argentina	0.75	0.25	0	4
Jamaica	0.54545454545455	0.27272727272727	0.18181818181818	11
Hungary	0.53333333333333	0.2	0.26666666666667	15
Croatia	0.5	0.3	0.2	10
Greece	0.5	0.16666666666667	0.33333333333333	6
Slovakia	0.5	0.5	0	4
Bahrain	0.5	0.5	0	2
Vietnam	0.5	0.5	0	2
Independent Olympic Athletes	0.5	0	0.5	2
Cote d'Ivoire	0.5	0	0.5	2
The Bahamas	0.5	0	0.5	2

Fun Fact: Swimming was added as an event in 1896 (freestyle); backstroke was added in 1904.

Michael Phelps

32nd among 205 currently competing countries as far as most medals won
28 total medals – 23 gold, 3 silver, 2 bronze
13 individual medals puts him ahead of Leonidas of Rhodes – sprinter form 152BC
50 miles swam per week in prep for 2008 Olympics; 12,000 calories consumed each day
If Katie Ledecky maintained her current medal pace, she’d be 39 before she tied Phelps
He hasn’t won bronze since 2004

Popularity of Events

Swimming, Track and Field, Gymnastics, and Soccer are the most popular sports for people to watch. 538 did an interesting comparison in the 2012 Olympics to come up with a medal multiplier based on number of events vs. number of viewers. The US, China, and Russia dominate on an adjusted medal count.

See the chart below (again based on London 2012). Sailing, for instance, has a lot of events but not much viewership, so it gets a reduction. Soccer, however, has only a few events but a large amount of viewers, so it’s multiplier is very high.

Growth – Interest in the Olympics, number of events, number of competitors, and costs are all going up. On a per capita basis, it hasn’t been this hard to win a medal since 1896.

Sources

006 For the Love of Cheesecake – For the Love of Data

Sat, 30 Jul 2016 03:32:07 +0000

National Cheesecake Day!

June 30, 2016 is National Cheesecake Day, a likely commercially driven holiday to which I, for one, am happy to fall victim. Adam’s PB Cup Fudge Ripple is one of my favorites and also one of the worst (go figure).

History (#4)

Believed to have originated around 2,000 BC in Greece
Was served to athletes in first Olympic games as a source of energy
Original recipe, documented in 230 AD was: mashed cheese, honey, flour heated into a mass
Around 18th century, more modern-like recipe emerged

Facts

Sonya Thomas holds the record for eating 11 pounds of cheesecake in 9 min. (9/26/2004). (#5)
Largest Cheesecake weight 6,900 pounds and was formed in Lowville, NY on 9/21/2013.

The cake measured 2.292 m (7 ft 6.25 in) in diameter, and .787 m (2 ft 7 in) tall. (#6)

Cheesecake Factory

Cheesecake Factory, back in 2013, began using IBM big data analytics to analyze consumption and ingredients on products across all their locations (#1). They also had 2.1b in revenue in 2015 (#3).

Cheesecake Factory Nutrition Info (#2)

Sources:

005 For the Love of Fireworks – For the Love of Data

Tue, 28 Jun 2016 13:11:31 +0000

News:

Big Data falls off the hype cycle: http://www.datasciencecentral.com/profiles/blogs/big-data-falls-off-the-hype-cycle
Tableau 10 in beta:
- http://www.tableau.com/about/blog/2016/4/10-reasons-join-tableau-10-beta-53165
- http://www.tableau.com/coming-soon

Fireworks!

NOTE: Overall, statistics are hard to follow across sites and even different reports from the same groups.

2015 Consumption Statistics:

Consumption: 260.7 Million lbs. (Consumer), 24.6 million lbs. (Display) (#5 APA)
- *** The consumer weight of fireworks used is roughly equivalent to the weight of the entire population of Hawaii! ***
Revenue: $755 million (Consumer), $340 million (Display) (#5 APA). Focusing on the consumer spending:
- This is over 100x more than the revenue Katy Perry would have generated with the 7 million U.S. sales of her song “Firework” on iTunes.
- This is more than the all the money we spent at In ‘N Out Burger in 2015.
- If one person spent the same amount on Roman Candles, it would take them over 1,000 years to use the amount of fireworks we purchase in a year.

Are fireworks Dangerous?

67% of fireworks injuries occur around July 4th

Injuries by Age: This graph makes it seem like young adults are most commonly injured, but when you look at 20 year bands, it breaks down differently:
- 0-19 = 47%
- 5-24 = 49%
- 10-24 (smaller than 20yr band) = 32%
- 25-44 = 34%

12,000 fireworks injuries (CPSC) out of 31 million injuries = .04% of injuries are fireworks (#7 APA)
11 Deaths (#9 CPSC)

– #8 APA

Usage is growing but injuries are falling according to the American Pyrotechnics Association
- Injuries are falling while consumption goes up. Injuries are also falling as our population has increased. However, in absolute terms injuries are relatively constant

APA also contends that fireworks injuries are a small minority of total injuries to kids

– #10 APA

Only three states (DE, MA, NJ) ban fireworks (#8 APA)

Links:

Consumer Products Safety Commission (CPSC) Fireworks Infographic – http://www.cpsc.gov/PageFiles/150398/Fireworks-Infographic-2015-web.pdf?epslanguage=en
National Fire Protection Agency – http://www.nfpa.org/public-education/by-topic/outdoors-and-seasonal/fireworks/reports-and-statistics-about-fireworks
Washington State Patrol – http://www.wsp.wa.gov/fire/statistics.htm
Statistics Brain (Various Sources) – http://www.statisticbrain.com/firework-statistics/
American Pyrotechincs Association – http://www.americanpyro.com/industry-facts-figures
2015 US Population – http://www.usnews.com/opinion/blogs/robert-schlesinger/2014/12/31/us-population-2015-320-million-and-world-population-72-billion
Fireworks injuries in perspective – http://www.americanpyro.com/assets/docs/FactsandFigures/fireworks%20injuries%20perspecitive.2016.pdf
Fireworks liberalization- http://www.americanpyro.com/assets/docs/FactsandFigures/consumpvinjuriesliberalizationgraph%201980-2010.pdf
CPSC 2014 Fireworks Report – http://www.cpsc.gov/en/Media/Documents/Research–Statistics/Injury-Statistics/Fuel-Lighters-and-Fireworks/2014-Fireworks-Annual-Report/?utm_source=rss&utm_medium=rss&utm_campaign=Fuel%2c+Lighters+and+Fireworks+Injury+Statistics
APA Injuries to Children – http://www.americanpyro.com/assets/docs/FactsandFigures/injuries%20to%20children%20ages%205-18%202016.pdf
US Income – http://www.deptofnumbers.com/income/us/
NFPA Fireworks Info Sheet – http://www.nfpa.org/~/media/files/research/fact-sheets/fireworksfactsheet.pdf?la=en
APA Fireworks Injures vs. Consumption – http://www.americanpyro.com/assets/docs/FactsandFigures/fireworks%20related%20injuries%20rtable%201976%20-2015.pdf
Katy Perry Firework Wikipedia – https://en.wikipedia.org/wiki/Firework_(song)
In ‘N Out Burger Sales – http://nrn.com/top-100/2015-top-100-restaurant-chain-countdown#slide-43-field_images-136081

004 The History of Hadoop – For the Love of Data

Wed, 25 May 2016 03:17:54 +0000

Let me set the stage for you…

It’s 2003: Chicago just won the Oscar for Best Picture and Grand Theft Auto: Vice City is the top selling video game. Apple iPods still have scroll wheels and iTunes just started selling music for the first time. From a tech standpoint, Windows XP is all the rage as the latest Windows OS and folks with a lot of money to spend are buying PCs with a Pentium 4 3.0 GHz processor, 512MB of RAM (or maybe up to 2GB max), and an 80GB hard drive. Oracle just released version 10g and Microsoft proponents are still using SQL Server 2000. Internet Explorer 6 dominants the browser wars with about 85% market share and two-thirds of the US still connect to the internet with a modem.

(Stats from various Google searches, CNET desktop reviews, and http://www.internetworldstats.com/articles/art030.htm)

In the years leading up to Hadoop’s inception, Doug Cutting, the first node in the Hadoop cluster, had been working on Lucene, a full text search libary, and then began work on indexing web pages with University of Washington graduate student Mike Cafarella. The project was called Apache Nutch, and it was a sub-project of Lucene. They made good progress getting Nutch to work on a single machine, but they reached the processing limits of that one machine and began manually clustering four machines together. The duo started to spend the majority of their time figuring out a way to scale the infrastructure layer for better indexing. In October 2003, Google released their Google File System paper. This paper did not describe exactly what Google did to implement their solution, but it was an excellent blueprint for what Cutting and Cafarella wanted to do. They spent most of the next year (2004) working on their implementation and labeled it the Nutch Distributed File System (NDFS). In this implementation, they made a key decision to replicate each chunk of data on multiple nodes, typically three, for redundancy.

After solving for infrastructure redundancy, the team set their sights on improving the computational side and taking advantage of the stable fabric of nodes. Google again provided a spark of inspiration with their MapReduce research paper. The approach provided parallelization, distribution, and fault tolerance; all of these work in conjunction to work through tasks quickly, regardless of hardware failures that might occur along the way.

In 2006, Cutting went to work for Yahoo, and the storage and compute components of Lucene separated into a sub-project called Hadoop. The name originated from a toy yellow elephant that belonged to Cutting’s son. In April Hadoop 0.1.0 was released and it sorted almost 2TB of data in 48 hours. By April of 2007 Yahoo was running two Hadoop clusters of 1,000 machines and other companies like Facebook and LinkedIn start to use the tool.

By 2008, Hadoop hit critical mass along several fronts. Yahoo transitioned the search index that drove their website over to Hadoop and contributed Pig to the Apache Software Foundation. Facebook also contributed Hive, bringing SQL atop Hadoop. The product also spawned commercial legs when Cloudera was founded; Cutting joined their ranks the following year.

In 2011 Hortonworks spun off from Yahoo, and the following year Yahoo’s Hadoop cluster reached 42,000 nodes. Also in 2012, Hadoop contributors began to replace MapReduce with YARN, an offshoot of MapReduce’s resource management and scheduling components. Late in the year Apache Hadoop 1.0 becomes generally available. In 2013, Yahoo begins YARN in production and Hadoop 2.2 debuts.

Fast forward to today and several vast ecosystems exist around Hadoop in among different prepackaged distributions. The most popular of these are Cloudera, Hortonworks, and MapR. Below is a snapshot of Hortonworks and Cloudera’s packaged components:

Hortonworks:

http://hortonworks.com/apache/

Cloudera:

https://www.cloudera.com/content/dam/www/static/images/diagrams/xcdh-diagram.png.pagespeed.ic.nO4DhDKW71.png

Sources:

https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704#.6r3dfkd07

https://en.wikipedia.org/wiki/Apache_Hadoop

http://www.datanami.com/2015/04/15/from-spiders-to-elephants-the-history-of-hadoop/

003 The Data of Taxes – For the Love of Data

Thu, 31 Mar 2016 22:06:14 +0000

Huge thanks to @Deepak90Mittal for hanging out with me on this episode!

News Prologue:

Gartner’s Magic Quadrant for BI is out – overhauled methodology, Oracle is out; Tableau and Microsoft (PowerBI) reign supreme!
- https://www.gartner.com/doc/reprints?id=1-2XXET8P&ct=160204&st=sb)
SQL Server on Linux! = millions of geeks rejoice and it may spell the end of Windows in the data center.
- http://blogs.microsoft.com/blog/2016/03/07/announcing-sql-server-on-linux/
- http://www.theregister.co.uk/2016/03/14/microsoft_sql_server_linux_not_insane/
Excel is the most popular DataViz tool by a longshot, followed by Python, D3, and Tableau
- http://www.randalolson.com/2016/03/11/what-data-visualization-tools-do-rdataisbeautiful-oc-creators-use/

A) 2016 State Comparison:

What you should drink Where – comparison of per gallon taxes on beer, wine and spirits converted to a per drink equivalency
1. Beer – Missouri
2. Wine – Lousiana
3. Liquor – Missouri
4. Best overall: Missouri, Wisconsin, California, Texas
Tax Freedom Day (a.k.a., Working for the Man) –
Interesting way to look at taxes – how long you have to work to cover federal, state, and local taxes for the year.
Gas Prices – Texas is pretty low (#42) on gas taxes! PA is #1; NY is #3; CA is #5

Gas tax rates 2016: http://taxfoundation.org/blog/state-gasoline-tax-rates-2016

B) Federal Income Tax Stats

http://taxfoundation.org/article/summary-latest-federal-income-tax-data-2015-update

A rough calculation of the rate at which individual tax returns are filed within the US:

Start of Year:	1/1/2016
Filing Date:	4/18/2016
Days Elapsed:	108
Total Est. Returns (using 2013 #):	138,313,155
*Total filed per day:**	1,280,677
*Total filed per hour:**	53,362
*Total filed per minute:**	889
*Total filed per second:**	15
* All calculations rounded to nearest whole number

Key Findings from the report (mostly using data from 2013):

In 2012, the top 50 percent of all taxpayers (69.2 million filers) paid 97.2 percent of all income taxes while the bottom 50 percent paid the remaining 2.8 percent.
The top 1 percent (1.3 million filers) paid a greater share of income taxes (37.8 percent) than the bottom 90 percent (124.5 million filers) combined (30.2 percent).
The top 1 percent of taxpayers paid a higher effective income tax rate than any other group, at 27.1 percent, which is over 8 times higher than taxpayers in the bottom 50 percent (3.3 percent).

002 What Hot Models Look Like – For the Love of Data

Mon, 29 Feb 2016 07:04:07 +0000

Summary:

Hot models…data models that is. A survey of many of the most popular data modeling approaches in the news today. Third Normal Form, Anchor Modeling, Data Vault, Data Lakes, Data Swamps. What do they do well, what do they do badly, and which is the one true data model to rule them all? (Hint: it depends, as usual.)

Third Normal Form (3NF) (a.k.a. Naomi Sims)

History: E.F. Codd defined 3NF in 1971 while working at IBM.

Basic Concept:

“The Key, the Whole Key, and Nothing but the Key” -Bill Kent

The gold standard for purist relational database design. If a table has the following characteristics:

1NF – a) Values in a particular field must be atomic and b) a single row cannot have repeating groups of attributes
2NF – in addition to being in 1NF, all non-key attributes of the table depend on the primary key
There is no transitive functional dependency

Pros:

A battle-tested, well-understood modeling approach that is extremely useful for transactional (OLTP) applications
Easy to insert, update, delete data because of referential integrity
Avoids redundancy, requiring less space and less points of contact for data changes
Many software tools exist to automatically create, reverse engineer, and analyze databases according to 3NF
Writing to a 3NF DB is very efficient

Cons:

Reading from a DB in 3NF is not as efficient
Not as easily accessed by end-users because of the increased number of joins
More difficult to produce analytics (trends, period-to-date aggregations, etc.)
Many times even transactional systems are slightly de-normalized from 3NF for performance or audit-ability
Some people feel that 3NF is no longer as appropriate in an era of cheap storage, incredibly fast computing, and APIs

Source: ewebarchitecture.com

(Sources:https://en.wikipedia.org/wiki/Third_normal_form; http://www.1keydata.com/database-normalization/third-normal-form-3nf.php; http://www.island-data.com/downloads/papers/normalization.html; http://www.johndcook.com/blog/2012/11/21/rise-and-fall-of-the-third-normal-form/)

Anchor Modeling (incorporates Sixth Normal Form [6NF]) (a.k.a. Gisele Bundchen)

History: Created in 2004 in Sweden

Basic Concepts: Mimics a temporal database

anchors – entities or events
- Example: A person
attributes – properties of anchors
- Example: A person’s name; can be historical, such as favorite color)
ties – relationships between anchors
- Example: Siblings
knots – shared properties, such as states or reference tables – combination of an anchor and a single attribute (no history)
- Example: Gender – only male/female

Pros:

Incremental change approach – previous versions of a schema are always encompassed in new changes, so backwards compatibility is always preserved
Reduced storage requirements by using knots

Cons:

Many entities are created in the database
Joins become very complex; hard for end user to understand model
Daunting for new technical resources to come up to speed initially

Source: bifuture.blogspot.com

(Sources: Wikipedia,http://en.wikipedia.org/w/index.php?title=Anchor_Modeling& oldid =570899764; http://www.anchormodeling.com/)

Data Vault (DV) (a.k.a. Heidi Klum)

History: Dan Linstedt developed the started implementing data vaults in 1990 and published the first version of the methodology (DV1.0) in 2000. He published an updated version (DV2.0) in 2013. The methodology is proprietary and Dan restricts who can train others by maintaining a copyright on the methodology and requiring people who train others to be Data Vault certified. You can still implement data vaults; you just cannot train others on it without being certified.

Basic Concept:

“A single version of the facts (not a single version of the truth)”

“All the data, all the time” – Dan Linstedt

The data fault consists of three primary structures and supporting structures such as reference tables and point-in-time bridge tables. The three main structures are:

Hubs – a list of unique business keys that change infrequently with no other descriptive attributes (except for meta data about load times and data sources). A good example of this is a car or a driver.
Links – relationships or transactions between hubs. These only define the link between entities and can easily support many-to-many relationships; again no descriptive attributes on these tables other than a few meta-attributes. An example of this would be a link between cars and their drivers.
Satellites – Satellites may attach to hubs or links and are descriptive attributes about the entity to which they connect. A satellite for a car hub could describe the year, make, model, current value, etc. These often have some sort of effective dating.

Source: http://www.slideshare.net/kgraziano/introduction-to-data-vault-modeling

General best practices:

Separate attributes from different source systems into their own satellites, at least in a raw data vault. Using this approach it may be common to have a raw data vault that contains source system specific information with all history and attributes maintained and a second downstream business data vault. The business data vault will contain only the relevant attributes, history, or merged data sets that have meaning to the users of that vault.
- Having a raw mart allows you to preserve all historical data and rebuild the business vault if needs change without having to go back to source systems and without losing data if it is no longer available the source system.
Track all changes to all elements so that your data vault contains a complete history of all changes.
Start small with a few sources and grow over time. You don’t have to adopt a big bang approach and you can derive value quickly.
It is acceptable to add new satellites when changes occur in the source system. This allows you to iteratively develop your ETL without breaking previous ETL routines already created and tested.

DV2.0 – DV1.0 was merely the model. DV2.0 is:

An updated modeling approach. Key changes include:
- Numeric IDs are replaced with hash values, created in the staging area, that support better integration with NoSQL repositories
- Because hashes are used, you can parallelize data loads even further because you do not have to lookup a surrogate ID if you have the business key to hash from when you’re bringing in data. This means you can load hubs, links, and satellites at the same time in some cases
- Referential integrity is disabled during loading
Recommended architectures around staging areas, marts, virtualization, and NoSQL
Additional methodology recommendations around Agile, Sixth Sigma, CMMI, TQM, etc.

Pros:

Preserves all data, all the time – this provides the capability for tremendous analysis and responding to changing business needs. The approach allows you to obtain data from multiple sources iteratively and rapidly, preserving backwards compatibility
Works extremely well with massively parallel processing (MPP) databases and hardware
Can be loaded extremely rapidly, particularly using the DV2.0 modeling approach
Lends itself very well to ETL and DW automation/virtualization
DV2.0 covers a wide spectrum of modeling needs from staging and marts to methodology

Cons:

The data model can spawn a lot of tables and make queries very complicated very quickly.
The raw data mart is really not meant for end users to query/explore directly
Iterative additions make the data model more complicated
Although storage may be cheap, keeping all changes for all data in all sources can lead to data sprawl. This also makes a pared down information mart almost a necessity.
Raw DV data is not cleansed and data from multiple sources are not blended when being stored

(Sources: https://en.wikipedia.org/wiki/Data_Vault_Modeling;http://roelantvos.com/blog/?p=1063; https://kentgraziano.files.wordpress.com/2012/02/introduction-to-data-vault-modeling.pdf)

Data Lake (DL) (a.k.a. Brooklyn Decker)

History: Term was coined by Pentaho CTO James Dixon in a blog post in 2010 referring to Pentaho’s data architecture approach to storing data in Hadoop.

Basic Concept: A massive, big data repository, typically on Hadoop or HDFS, at least. Key points are that it is:

Schema-less – data is written to the lake in its raw form without cleansing
Ingests different types of data (relational, event-based, documents, etc.) in batch and/or real-time streaming
Automated meta data management – a best practice is to use tools to automatically catalog meta data to track available attributes, last access times, data lineage, and data quality
Typically multiple products are used to load data into and read data from the lake
Rapid ability to ingest new data sources
Typically only a destination; it is usually not a source from which operational systems will source data

Pros:

Useful when you do not know what attributes will be needed or used.
Schema on Read – can ingest any type of data and allow different users to assess value during analysis
Extremely large scale at low to moderate cost
Can and will use a variety of tools/technologies to analyze/visualize/massage data into a useful form

Cons:

Can me seen as a vast wasteland of disorganized data, particularly without good meta data
Consumers must understand raw data in various systems to know how to integrate and cleanse it in order to derive meaningful information
High likelihood that different consumers will perform very similar operations to retrieve data (i.e., overlap and duplication of efforts). Slight differences between groups can lead to reconciling differences
Uncleansed data and multiple versions of the same data may possibly lead to duplication if not handled/filtered carefully
It isn’t SQL – Some users will have to use more than just SQL to derive useful information from data
- Offloading ETL can require significant rework of existing processes to move to something like Hive
Using multiple tool sets can lead to training and supportability challenges if not governed properly
Data curation can by very challenging

– Source: http://www.slideshare.net/ImpetusInfo/enterprise-big-data-lake-challenges-strategies-maximizing-benefits-impetus-webinar-44720024

(Sources: http://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/; http://knowledgent.com/whitepaper/design-successful-data-lake/; http://martinfowler.com/bliki/DataLake.html; https://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_ Datalake_White-Paper_ 20140410.pdf; http://cacm.acm.org/blogs/blog-cacm/181547-why-the-data-lake-is-really-a-data-swamp/fulltext)

Data Swamp (DS) (a.k.a. Tyra Banks)

History: I’m not including a lot of history here, because this is really an extension of a Data Lake (gone bad).

Basic Concept: A data swamp is a data lake that has been poorly maintained or documented, lacks meta data, or has so much raw data that you don’t know where to start for insights. Or, it could be a combination of several of those points. When you start tracking tons of data from all different sources, but you don’t know who is using what, how to merge data sets, or how to use most of the data in your “data lake”, you’ve really got a data swamp.

Pros:

Hey, you must’ve done something right to get all that data into the repository…?
At least you haven’t lost data that you can’t go back and get.
If it were easy, everyone would be doing it

Cons:

You’ve likely spent a lot of time and effort putting in a data lake/HDFS/Hadoop/Hive/etc. and you’re struggling to operate it at scale or to answer the questions you set out to answer.
You need meta data to clue users into what is most useful, relevant, or recent
You probably need to look into key use cases (low hanging fruit) and start from that point as a place to begin using/resuscitating your repository.

*** The assignment of model names to each data model was an incredibly (un)scientific process of googling various terms like “most famous supermodel ”, “ top supermodel”, etc. and teasing out the most likely #1. Feel free to disagree and let me know your vote and how you obtained it.

001 The Data of Church – For the Love of Data

Fri, 11 Dec 2015 22:04:48 +0000

Churches have a wealth of data that other organizations could only dream about–a weekly stream of attendees and donors who also participate in a wide variety of activities around the organization. In this episode, I sit down with Glen Brechner, the Executive Director of Chase Oaks Church in Plano. Chase Oaks is one of the top twenty churches in the DFW metroplex and is in the top 20% of megachurches in the US.

We discuss how they track member participation and donation information, how they consolidate and align data across multiple campuses, and challenges and opportunities they see with data.

Some of the tools they use include:

Excel (doesn’t everybody!)
Mortarstone
Shelby
Arena

Please leave a comment about the episode and let me know if you have any questions.

000 Introducing “For the Love of Data with Robert Furr” (and what it means for you)

Mon, 19 Oct 2015 04:53:23 +0000

Data, Analytics, Business Intelligence… how do I keep track of what is going on in this ever-expanding technology realm? “For the Love of Data” is a monthly podcast covering data, big data, huge data, tiny data, analytics, and business intelligence trends across the industry. Join the discussion, write a review, or give us your feedback on our site.

This introductory episode covers the podcast’s format, why I want to do it (because I love data!), and who may benefit from listening.