Arithmetic Expressions

Arithmetic Expressions inside Doctrine. It's hard to see how that could sound more like an extract from one of those weighty Computer Science tomes, the long-faded insides covered in lazy pencil scribbles.

And yet, when you get past the name, the concept itself does apply to some real world situations. Only you probably wouldn't think to Google for "Doctrine Arithmetic Expression" if you had this particular problem - well, I wouldn't have done anyway.

Throughout this series we have been making use of the Topic and Reply entities to simulate a forum.

In this video, what I would like to show is how to take a combination of some of our Topic properties, and use these as a basis of creating a score on which we can then sort our Topics.

The concept here is pretty simple on the face of it - we count the number of Reply entities per Topic, and then to that we add on the new rating property. When added together, this number becomes our score, which isn't the World's most advanced algorithm I'm the first to admit, but it does illustrate our example sufficiently.

Generating Our Rating

In this example, the rating property is nothing more than a random number between 1 and 5.

To achieve this, I am making use of the excellent Faker Bundle by Will Durand which makes this a cake walk.

All we need to do is update our config_dev.yml to include the new custom_formatter:

# config_dev.yml
bazinga_faker:
    orm: doctrine
    entities:
        AppBundle\Entity\Topic:
            number: 80
            custom_formatters:
                rating: { method: numberBetween, parameters: [1,5] }
                * snip snip *

Now, in your system, the way that you generate your rating or whatever other properties you are using likely will differ significantly, but the end result will be roughly the same.

Counting The Size Of A Doctrine Collection

As mentioned above, aside from the rating, we also want to get the sum / count / total of the number of Reply entities that are related to our Topic.

Now, when I first started with this, I instinctively tried to use the COUNT() function, which promptly blew up in my face.

The problem I ran into here is not particularly obvious from the error message:

[Semantical Error] line 0, col 18 near 'replies) + t.rating': Error: Invalid PathExpression. StateFieldPathExpression or SingleValuedAssociationField expected.

Awesome.

Wtf?

Thank the Lord that gives us at least this bit: replies) + t.rating, otherwise I would have been throwing in the towel.

The fix for this is really simple - we just swap our COUNT for SIZE. This is documented, and it's worth paying attention to the other functions available here as they are exactly the sort of things that normally catch me out.

As such, we need to update our Query Builder to use SIZE, so changing from:

->addSelect(COUNT(t.replies) + t.rating AS HIDDEN score)

to:

->addSelect(SIZE(t.replies) + t.rating AS HIDDEN score)

pretty easy when you know why - so remember, read the docs Luke.

As such, our full query becomes:

$topics = $qb->select('t')
  ->addSelect(SIZE(t.replies) + t.rating AS HIDDEN score)
  ->from('AppBundle:Topic', 't')
  ->orderBy('score', 'DESC')
  ->setMaxResults(20)
  ->getQuery()
  ->getResult()
;

Of course, score is an arbitrary name, you can call it whatever you like. I swap and change in the video to use score or weight. You can call / name it whatever makes sense to your project / domain.

How I Use This In The Real World

I want to preface this by saying that this is not the exact method I use currently. However, it is close to what I do, and my specific example wouldn't make much sense without being shown in a greater context.

I run a number of price comparison websites.

As part of this process, I accumulate product data from numerous sources. Let's say for example - Amazon and Walmart. The numbers of sources are greater than this, but the number doesn't matter so much.

Let's say that we get a list of products from Amazon (if only it were this simple):

# Amazon
Product Id | Product Category | Product Title
12345      | Games - PS3      | Grand Theft Auto 5
12362      | Games - PS3      | Red Dead Redemption
12378      | Games - PS4      | Grand Theft Auto 5 Special Edition

# Walmart
Product Id | Product Category | Product Title
AA221      | Consoles - PS3   | Grand Theft Auto 5
AB009      | Consoles - PS3   | Red Dead Redemption
CD078      | Consoles - XBone | Grand Theft Auto 5 Special Edition

The descriptions would also be very important in this, but the amount of data would get crazy, so let's pretend it's not that important.

We need to figure out - ideally without human intervention - which of these products are the same. The trick is not to match up the right games in the wrong consoles. No point telling Jonny Price Saver that GTA5 is available on the search results when he is after the PS4 version, but we go and show him the PS3 results.

We need a way of putting a weight or a score (or whatever else you want to call it) on key fields.

Some of the fields are useless. We can safely say that the Amazon and Walmart product IDs are pretty much irrelevant to us at this stage.

# Amazon
Product Category | Product Title
Games - PS3      | Grand Theft Auto 5
Games - PS3      | Red Dead Redemption
Games - PS4      | Grand Theft Auto 5 Special Edition

# Walmart
Product Category | Product Title
Consoles - PS3   | Grand Theft Auto 5
Consoles - PS3   | Red Dead Redemption
Consoles - XBone | Grand Theft Auto 5

Adding a modifier / weight to the resulting Product Category would be beneficial to influencing our query outcome.

We pull out:

# Amazon
Product Category | Product Title
Games - PS3      | Grand Theft Auto 5

It's unlikely that Amazon are listing GTA5 twice on PS3, so we can safely(!) say we don't much care about any other Amazon results for this comparison. But we do care about Walmart's products.

If we did a naive search - on Product Title alone - we would find that we had two matches at Walmart. The PS3 version and the XBone version. Let's say our search results show something like this:

Product Category | Product Title       | Match Likelihood
Consoles - PS3   | Grand Theft Auto 5  | 100%
Consoles - PS3   | Red Dead Redemption | 7%
Consoles - XBone | Grand Theft Auto 5  | 100%

We don't want that.

So instead we assign our score to our Category, and include that - even though it is hidden from our end result - and use that to influence our outcome:

Product Category | Product Title       | Match Likelihood
Consoles - PS3   | Grand Theft Auto 5  | 100%
Consoles - PS3   | Red Dead Redemption | 7%
Consoles - XBone | Grand Theft Auto 5  | 57%

Now we can eliminate anything below maybe 80% as being irrelevant to us. Anything above 80% may still trigger a human verification of the matches.

That's the gist of how I use the weight or score in the real world. This is a simplistic view of the situation, and many other factors influence the eventual score outcome, and I don't use Doctrine for this either. But it's nice to know that the functionality exists and is usuable.

For reference, the reason I don't use Doctrine for this is that Elastic Search is about 1000x more effective. I don't have benchmarks immediately available to prove this, but I've used both at various stages for part or all of that implementation, and have the battlescars to know which came out on top.

Because We Can Do It Another Way

Getting back on topic (ahem, no pun intended), this wouldn't be a Symfony / Doctrine video if we didn't have multiple ways of doing the same thing.

Who said this was difficult to learn? :)

We can swap out the Query Builder way of doing things for our straight up DQL equivalent:

$query = $em->createQuery(
    'SELECT
        t,
        SIZE(t.replies) + t.rating AS HIDDEN score
    FROM
        AppBundle:Topic t
    ORDER BY
        score
');

$topics = $query->setMaxResults(20)->getResult();

And ultimately that does exactly the same thing as our previous Query Builder approach. It's just nice to know of both ways. The Doctrine docs tend to show the DQL approach as well.

So there we have it, Arithmetic Expression, a funky sounding 'thing' that may very well be just what you didn't realise you were looking for.

Additional Reading

Javier's talk from Symfony Con Madrid (the talk mentioned in the video)

#	Title	Duration
1	Improving Page Load Time With Doctrine Join	03:03
2	Doctrine Extra Lazy Associations	03:47
3	What's In A Name?	00:58
4	Partial Objects - Holy Premature Optimisation Batman	05:05
5	Keeping Hyrdrated	01:29
6	Doctrine Indexes	04:06
7	Doctrine Cache Money	08:11
8	Please Provide At Least Two (2) Written References	04:19
9	Arithmetic Expressions	10:15