Doctrine Indexes


In this video we take a look at one optimisation that you should be using on every project you are a part of. It's super simple, and the benefits are vast.

Indexes.

As our database tables grow, adding an index, or multiple indexes, is pretty much essential, unless you enjoy waiting, and you aren't paying the air conditioning bill for all that extra heat you're forcing the poor CPU to generate ;)

I mention this topic because I have seen numerous projects which have come our of development, moved into production, and then begun grinding to a halt once real user data accumulates.

Shocking, but true.

It's quite easy to see why this happens though. Development environments rarely deal with 1,000,000 row tables. But in production, that's common.

This is usually the first warning sign that if indexes have been missed, other horrors may lie in wait. However, as the project is likely covered with a thorough suite of tests, we have nothing else to worry about, right? Oh... there are no tests?

The Old Phone Book Analogy

Any sufficiently weighty tome should have an index. Need to know where the rules are for the Greater Deamon of Khorne? Well, there's probably only one of those, so it may well be in the Table of Contents, or failing that, find the other Greater Deamons, and it will hopefully be in there. There should only be four, how hard can it be?

What about D. Smith?

There's 36 pages of people with the surname Smith in my area, and 4 of those pages contain Smiths beginning with D.

Yikes.

Now, someone took the time to organise and order those Smiths for us. We have 36 pages to search, and we know - give or take - that 'David' is going to be somewhere in the first 10 pages or so.

What would happen if the phonebook chaps decided to operate on the same principle that most software projects use?

We're going to do something pretty simple. We take an array. And for every name in our area, we add a new entry. Who cares about the order? Not me. That's too much hard work. And besides, no one even asked us to sort it. Pah. I'm off to browse Reddit.

After collecting all those names, we write ourselves a short printer script and dump the whole load out on reams of paper. Then someone delivers the books and two weeks later, you're sat at home and you realise you need to make an urgent call to your friend Tim Jones.

Stick with me, and imagine a (much happier) time when the Smart Phone was a large brick with a twirly cord to an even larger brick. Or the early 1990s as it was better known.

Opening the new phone book, freshly delivered and eager to be used, you realise that the lousy phone book monkey never sorted or ordered the contents. Finding any name now is going to be a crap shoot. All your P's and Q's have been muddled up, Albertson next to Laker, next to Makinson.

Finding Tim is going to take you a long time. You resign to starting at page 1, name 1, and working down the list until you find him. Not the best way to spend your day.

Take Heed The Fun Story

If you don't bother telling Doctrine what columns to index, you are going to be in for a bad time.

In our example above, we would have to wait for MySQL to iterate through all the people in the phone book until it found our result. It couldn't do it quickly because there is no semblence of order.

Your database will obviously do that task much quicker than any human ever could. But what if we have a busy system, and there are many such queries? The bottleneck will show its head at the most inopportune moment, you can be sure.

Primary Key

The concept of an Entity implies that the data it holds is sufficiently interesting to us that we want to store that data when it's not actively in use by the system.

As such, when we create our Doctrine Entities we will add in an id field.

/**
* @var integer
*
* @ORM\Column(name="id", type="integer", nullable=false)
* @ORM\Id
* @ORM\GeneratedValue(strategy="AUTO")
*/
private $id;

By adding @ORM\Id, we are telling Doctrine to make this field our Entity id.

We combine this with an Identifier Generation Strategy - in simplistic terms, a fancy way of saying an automatic number generator (but it need not be just a number), in our case: @ORM\GeneratedValue(strategy="AUTO"), which will then ensure our database - e.g. MySQL - gives us an automatically incremented number when persisting our Entity to the database.

This unique identifier, the id, becomes our Primary Key.

Now, this isn't a lesson on Primary Keys, but rather, I use this as an illustration to say that we will get an Index automatically created for us when doing this.

The problem is, in many cases we will not know the Primary Key when doing our search.

Back to the phone book, we wouldn't know the Primary Key for Tim Jones, and even if we did, it's useless to us anyway, as they don't print the Primary Keys next to the entries.

If we did have direct access to the phone book database, if we knew Tim's Primary Key, we could do something simple like:

$timsPk = 4574211;
$tim = $em->getRepository('AppBundle:Person')->find($timsPk);

But we don't. So we end up doing a query like:

$allKnownTimJones = $em->getRepository('AppBundle:Person')->findBy(array(
  'firstName' => 'Tim',
  'lastName'  => 'Jones',
));

And that returns us all the people matching Tim Jones, and then we use the address to figure out which Tim Jones we actually want.

However, we still haven't added any indexes, so this would still be slow.

Adding Indexes In Doctrine

The seemingly obvious solution here is to add two indexes.

Let's try that:

<?php

// src/AppBundle/Entity/Person.php

use Doctrine\ORM\Mapping as ORM;

/**
 * @ORM\Entity
 * @ORM\Table(name="person", indexes={
 *     @ORM\Index(name="idx_first_name", columns={"firstName"}),
 *     @ORM\Index(name="idx_last_name", columns={"lastName"}),
 * })
 */
class Person
{
    /**
     * @var string
     * @ORM\Column(type="string", length=100, nullable=false)
     */
    private $firstName;

    /**
     * @var string
     * @ORM\Column(type="string", length=100, nullable=false)
     */
    private $lastName;
}

On the surface, this seems good.

We have an index on both fields, so searches should now be much faster, right?

Well, not quite.

This is a great strategy if we regularly search for one field OR the other, e.g.:

$allJones = $em->getRepository('AppBundle:Person')->findBy(array(
  'lastName'  => 'Jones',
));

This will be nice and quick because we now have a pre-prepared index of all the entries in our table, ordered by Last Name, so our database can do a much more efficient search.

But, because we are searching for a combination of two fields, our database will have to take a guess at which index to use out of the two.

This is not so good, as we aren't making the best use of either index.

Instead, we can combine two (or more) fields into one index:

/**
 * @ORM\Entity
 * @ORM\Table(name="person", indexes={
 *     @ORM\Index(name="idx_first_name", columns={"firstName"}),
 *     @ORM\Index(name="idx_last_name", columns={"lastName"}),
 *     @ORM\Index(name="idx_both_names", columns={"firstName", "lastName"}),
 * })
 */
class Person
{

And now, Doctrine will instruct our database to create an index on both fields together, and so our search will be much more efficient, as the resulting matched set of Tims and Jones will be far fewer than before.

Partial Indexes

There's also the very cool Partial Index functionality available - as long as your underlying database supports it (PostgreSQL and SQLite support is in Doctrine DBAL 2.5, which as of Symfony 2.6 is not currently included).

An example of a Partial Index, stolen straight from the Doctrine docs:

<?php
/**
 * @Entity
 * @Table(name="ecommerce_products",indexes={@Index(name="search_idx", columns={"name", "email"}, options={"where": "(((id IS NOT NULL) AND (name IS NULL)) AND (email IS NULL))"})})
 */
class ECommerceProduct
{
}

Last Things Last - Lessons Learned in the Field

One thing I learned whilst recording this video was that using indexes cannot speed up LIKE queries where you are using a variable left side.

That likely makes no sense, so an example:

SELECT * FROM foo WHERE field LIKE "bar%" # will be faster with an index

Whereas:

SELECT * FROM foo WHERE field LIKE "%bar%" # will not be faster with an index

I make this mistake in the video.

I thought I would leave it in, as I thought it was an interesting fact.

I tend not to use LIKE queries so much these days honestly. I used to do a lot of Full Text querying, but have since found Elastic Search to be a heck of a lot better at that task.

Code For This Course

Get the code for this course.

Episodes