Wednesday, July 18, 2018

Database indexes


Few days ago I read an article (https://www.toptal.com/database/database-design-bad-practices), which describes some best practices on database schema design. Among the practices described, I would like to extend original post with more details on indexes.

What is an index?


Index is a special type of data, which does not contain table data, but describes where to find records in the table data-file.

We need to understand that majority of databases today work as a tape-recorder. By tape-recorder I mean, that all records that we insert to the table are appended to the end of table data-file on the disk (we are not considering in-memory databases here).
So, basically index describes at what position a record with specific field value is present in table data-file.

How indexes work?


Usually indexes represent some kind of search tree data-structure (usually B-trees).
When record is inserted to a table, the indexed field value is also inserted to search tree structure, which points to a record address in table data-file.

Why search tree structures are used under the hood?
Search tree structures guarantee almost constant time when you search for a value in the tree. Well, not exactly constant time, in reality it is logarithmic time, however it is a bit slower than constant time but much faster than linear search.

For example, you’ve got a table with 1 million records. To find a single record in such table without using an index you will have to iterate through all records and perform 1 million iterations in the worst case.

However, if we take simplest binary search tree as the index implementation, we would make only log2(1000000) iterations, which approximately equals to 20!

Indexes within database environment.


It should be mentioned, that indexes are not always used by the database engine (DBE).

For example, you have a table with 1 million records, and the table has field “Age”.
For instance, value Age = 29 is present in 999500 records, value Age = 30 is present in 500 records, and you have created an index for Age field.

Then you run a query to retrieve all records where field Age is equal to 29.
With 99% probability, your index will not be used, and DBE will choose to use Full Table Scan (FTS) instead (i.e. iterate through all the records).

Why would it do that way? Didn’t we just say that indexes increase our extraction time?

Let’s take a look at our data first, 99.95% of our records contain Age value which is equal to 29.
So using FTS for extracting 99.95% of table data is more efficient than using index lookup. DBE will simply iterate through all table records and ignore records where Age is not equal to 29.

On the other hand, if you run query to return all records where field Age is equal to 30, then DBE will use index, because index lookup will work faster to extract 0.05% of table data.

The amount of records that should be returned by the query compared to total amount of records in the table is called selectivity. If query shows good selectivity, then index is used, otherwise it is more likely that DBE will iterate through entire table.

Alright, we discovered what selectivity is, but how does the DBE knows how many records are there in table with specific values? Does it pre-scans entire table?
No it does not do any pre-scaning.

Remember we said: when new record is inserted to the table, we also insert new value to index data-structure? When DBE inserts new value to index data-structure it also adds indexed value to index statistics.

Index statistics keeps information regarding how many times certain values are present in the table. Based on such index statistics, DBE chooses the strategy for query execution and data extraction.

Summary.


In this article I tried to give some introductory information on how database indexes work.
Hopefully this will help newcomers to conquer database design science faster.
Use indexes wisely.
Good luck!