- Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices
- Trained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.
- FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.
- They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)
- “Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.
Observations / Questions:
- Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.
- B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.
- The paper made some rudimentary assumptions, such as using a random hash function.
- What if the data is not static? How long would it take to retrain the model vs. maintain an index?
- What if data profiling caused you to index certain attributes and not others?
- What are the best practices with this newer approach
- The power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.
- Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”
- Recursive question: do you need an optimizer to optimize the optimization path?
- Room for improvement:
- Incorporating common queries into the model to know what questions people are asking
Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org