Few days ago I read
an article
(https://www.toptal.com/database/database-design-bad-practices),
which describes some best practices on database schema design. Among
the practices described, I would like to extend original
post with more details on indexes.
What is an index?
Index is a special type of
data, which does not contain table data, but describes
where to find records in the table data-file.
We need to understand that
majority of databases today work as a tape-recorder. By tape-recorder
I mean, that all records that we insert to the table are appended to
the end of table data-file on the disk (we are not considering
in-memory databases here).
So, basically index describes
at what position a record with specific field value is present in
table data-file.
How indexes work?
Usually indexes
represent some kind of search tree data-structure (usually B-trees).
When record is
inserted to a table, the indexed field value is also inserted to
search tree structure, which points to a record address in table
data-file.
Why search tree
structures are used under the hood?
Search tree
structures guarantee almost constant time when you search for a value
in the tree. Well, not exactly constant time, in reality it is
logarithmic time, however it is a bit slower than constant time but
much faster than linear search.
For example, you’ve
got a table with 1 million records. To find a single record in such
table without using an index you will have to iterate through all
records and perform 1 million iterations in the worst case.
However, if we take
simplest binary search tree as the index implementation, we would
make only log2(1000000) iterations, which approximately equals to 20!
Indexes within database environment.
It
should be mentioned, that indexes are not always used
by the database engine (DBE).
For
example, you have a table with 1 million records, and the table has
field “Age”.
For
instance, value Age = 29 is present in 999500 records, value Age = 30
is present in 500 records, and you have created an index for Age
field.
Then
you run a query to retrieve all records where
field Age is equal
to 29.
With
99% probability, your index will not be used, and DBE will choose to
use Full Table Scan (FTS)
instead (i.e. iterate through all the records).
Why
would it do that way? Didn’t we just say that indexes
increase our extraction time?
Let’s
take a look at our data first, 99.95% of
our records contain Age value which is equal to 29.
So
using FTS for extracting 99.95% of table data is more efficient than
using index lookup. DBE will simply iterate through all
table records and ignore records where Age is
not equal to
29.
On
the other hand, if you run query to return all records where field
Age is equal to 30, then DBE will use index, because index lookup
will work faster to extract 0.05% of table data.
The
amount of records that should be returned by the query compared to
total amount of records in the
table is called selectivity.
If query shows good selectivity, then index is used, otherwise it is
more likely that DBE will iterate through entire table.
Alright,
we discovered what selectivity is, but how does the DBE knows how
many records are there in table with specific values? Does it
pre-scans entire table?
No
it does not do any pre-scaning.
Remember
we said: when new record is inserted to the table, we also insert new
value to index data-structure? When DBE inserts new value to index
data-structure it also adds indexed value to index statistics.
Index
statistics keeps information regarding how many times certain values
are present in the table. Based on such index
statistics, DBE chooses the strategy for query execution and data
extraction.
Summary.
In
this article I tried to give some introductory information on how
database indexes work.
Hopefully
this will help newcomers to
conquer database
design science faster.
Use indexes
wisely.
Good luck!