E024 – Will machine learning kill traditional database indexes?

In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.

Background:

  • Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices
  • Trained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.
  • FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.

  • They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)
  • “Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.

Observations / Questions:

  • Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.
  • B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.
  • The paper made some rudimentary assumptions, such as using a random hash function.
    • What if the data is not static? How long would it take to retrain the model vs. maintain an index?
    • What if data profiling caused you to index certain attributes and not others?
    • What are the best practices with this newer approach
  • The power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.
  • Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”
  • Recursive question: do you need an optimizer to optimize the optimization path?
  • Room for improvement:
    • GPUs/TPUs
    • Incorporating common queries into the model to know what questions people are asking

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources: