Scraping Reddit Authors


In this video we are going to continue on with the implementation of the Reddit Scraper. Towards the end of the last video we had created a separate class to contain our RedditScraper code, which we had also declared as a Symfony Service. We had also used Guzzle to grab the contents of Reddit's /r/php subreddit, and stored the list of posts as a PHP array.

What we can now start to do is loop through the individual posts inside the response from Reddit, and extract the interesting parts, and use those parts to create individual entities. We already defined a RedditPost entity back in the first video, and just because our data is now real / dynamic does not mean the process is any different from when we manually created entities in video three.

// src/AppBundle/Service/RedditScraper.php

    public function scrape()
    {
        $client = new \GuzzleHttp\Client();

        $response = $client->request('GET', 'https://api.reddit.com/r/php.json');

        # if struggling here, try setting a User Agent
        # https://github.com/reddit/reddit/wiki/API#rules
        # thanks Mike, for the heads up
        // $response = $client->request('GET', 'https://api.reddit.com/r/php.json', ['headers' => ['User-Agent' => 'Test/12345']]);

        $contents = json_decode($response->getBody()->getContents(), true);

        foreach ($contents['data']['children'] as $child) {

            $redditPost = new RedditPost();
            $redditPost->setTitle($child['data']['title']);
            $this->em->persist($redditPost);

        }

        $this->em->flush();
    }

There's a subtle but important step happening here.

Let's say we have received 25 new post entries from our Reddit lookup. We have those entries decoded from JSON, and stored into our PHP array - $contents.

We run a foreach to iterate over this $contents array, using the nested keys of data and children, as returned by Reddit's API.

Inside the foreach we create a new RedditPost() object for each item in the loop.

Grabbing some of the data from the Reddit result (the title), we can configure the entity / object just like any other time we work with objects. There is nothing special about this entity from this point of view - don't let the fact that this is also a Doctrine entity confuse you in any way. From the code's point of view, this is just a plain old PHP object at this stage.

However, because this is an entity - which is to say, it has metadata in the form of annotations to tell Doctrine how to store this object's data to the database - we can pass this object to Doctrine to be stored / persisted:

$this->em->persist($redditPost);

Ok, so far, so good.

But here's the catch.

A persist won't actually do anything until the request to persist is actioned, by way of a flush. From Doctrine's documentation:

It is very important to understand that only EntityManager#flush() ever causes write operations against the database to be executed. Any other methods such as EntityManager#persist($entity) or EntityManager#remove($entity) only notify the UnitOfWork to perform these operations during flush. Not calling EntityManager#flush() will lead to all changes during that request being lost.

What happens is that whenever we tell Doctrine to persist, the operation is added to a queue of tasks to perform.

This queue will only be actioned / flushed when we explicitly tell Doctrine to flush() the changes to persistent storage / the database.

However, a flush() operation is a costly operation. Notice therefore, that in our loop, the call to flush() only happens once - outside the foreach loop. Whereas persist will be called as many times as there are items in the loop. From the Doctrine documentation:

Do not invoke flush after every change to an entity or every single invocation of persist/remove/merge/... This is an anti-pattern and unnecessarily reduces the performance of your application. Instead, form units of work that operate on your objects and call flush when you are done. While serving a single HTTP request there should be usually no need for invoking flush more than 0-2 times.

Now, this can cause some interesting and confusing circumstances - if forgotten - as we shall see shortly.

At this stage we should be able to run the scraper and find 26 records in the reddit_post table. Incidentally, 26 results is the default number I seem to always get back from a page 1 search of reddit.

Keeping Track of a Post's Author

As part of the result we get back from Reddit, there are a ton of fields, but an interesting one for us is the author field. We could do a further query of Reddit's API to get information specific to the author, but for our purposes, the author's name is sufficient.

With the name, we want to populate our RedditAuthor entity in much the same way that we did for RedditPost.

    public function scrape()
    {
        $client = new \GuzzleHttp\Client();

        $response = $client->request('GET', 'https://api.reddit.com/r/php.json');

        $contents = json_decode($response->getBody()->getContents(), true);

        foreach ($contents['data']['children'] as $child) {

            $redditPost = new RedditPost();
            $redditPost->setTitle($child['data']['title']);

            $authorName = $child['data']['author'];

            $redditAuthor = new RedditAuthor();
            $redditAuthor->setName($authorName);

            $this->em->persist($redditAuthor);
            $this->em->persist($redditPost);
        }

        $this->em->flush();
    }

This is a very naive implementation that is (highly likely) immediately going to present us with a problem. Depending on when you run the Reddit Scraper, you may or may not retrieve 26 posts containing one or more posts by the same author.

Now interestingly, this is exactly the sort of crazy real world problem that you might not even consider when writing your scraper. It might not even bite you for a few runs of your scraper. But sooner or later, a duplicate will show up and surreptitiously sneak into your database table. Uh-oh, now you have some unwanted duplicates. Bad times.

However, if you run the above code, it will work - assuming Reddit's API is playing ball. I had a few time outs and such when I worked through this example. If at first you don't successfully scrape, try, try, try again :/

As soon as you realise that those pesky duplicates are making a big old mess of your data, you will likely want to make your RedditAuthor entity a little more robust. We can add an extra optional attribute to our RedditAuthor::name to only allow uniques:

// src/AppBundle/Entity/RedditAuthor.php

class RedditAuthor
{
    // snip

    /**
     * @ORM\Column(type="string", unique=true)
     */
    protected $name;

This will make a change to the underlying database table, so we need to get Doctrine to do a little work here:

php bin/console doctrine:schema:update --force

Now, depending on whether you have already run the scraper or not, depends on whether this will succeed or not. If you have duplicate results in your reddit_author database table then this command will blow up - after all, you are saying you only want uniques, but your existing data contains duplicates.

To fix this you could drop and recreate your database:

php bin/console doctrine:database:drop --force
php bin/console doctrine:database:create
php bin/console doctrine:schema:update --force

Once you have truncated your database tables, if you re-run the scraper (and assuming duplicate authors exist in the result), then you will hit upon another problem.

Remember how Doctrine queues up any objects to persist until we call flush?

Let's imagine our result set looks like this:

  • Post A : Author 1
  • Post B : Author 2
  • Post C : Author 1

When we run through our loop, we end up with something like:

  • Insert "Author 1" into authors table
  • Insert "Author 2" into authors table
  • Insert "Author 1" into authors table

Oops, we're still trying to insert Author 1 multiple times.

Fortunately, fixing this is straightforward, if a little unintuitive:

// src/AppBundle/Service/RedditScraper.php

    public function scrape()
    {
        $client = new \GuzzleHttp\Client();

        $response = $client->request('GET', 'https://api.reddit.com/r/php.json');

        $contents = json_decode($response->getBody()->getContents(), true);

        foreach ($contents['data']['children'] as $child) {

            $redditPost = new RedditPost();
            $redditPost->setTitle($child['data']['title']);

            $authorName = $child['data']['author'];

            $redditAuthor = $this->em->getRepository('AppBundle:RedditAuthor')->findOneBy([
                'name' => $authorName
            ]); // inefficient, see below. 

            if (!$redditAuthor) {
                $redditAuthor = new RedditAuthor();
                $redditAuthor->setName($authorName);

                $this->em->persist($redditAuthor);
                $this->em->flush();
            }

            $this->em->persist($redditPost);
        }

        $this->em->flush();
    }

Firstly we check if there is a RedditAuthor entity already configured for an author with the provided name, and then only if not do we create, persist, AND flush the new RedditAuthor entity. It's the flush() that is perhaps the most unintuitive part.

Remember, without flushing, Doctrine won't actually write anything to the database.

By calling persist, we have only added the insert statement to the queue of statements to run on the next flush.

If we don't flush until the entire loop has finished - aside from very likely hitting the duplicate error - a new RedditAuthor will always be created because a previous entry for the same author hasn't been written to the database yet, only added to the queue of new RedditAuthor entities to create - so there is nothing in the database table when we run the findOneBy query.

There is an unfortunate side effect to this method though. Doctrine will go ahead and run a new findOneBy query for every single iteration of the loop.

Earlier, we said a flush is costly because lots of activity may happen all at once. Yet keeping all this activity together is more efficient than running lots of small, individual queries in an ad-hoc manner. Unfortunately, when querying by a property other than a record's ID, Doctrine has to make a database lookup. There are definitely ways to improve on this system, but on this small scale, this implementation is sufficient.

Adding Indexes

We've got a Reddit post scraper pulling in real data and populating our Doctrine entities now for both RedditPost and RedditAuthor. We haven't yet related the two together - this will happen in the next video.

Before we start working on the relationship, there is one really important thing to cover: Indexes.

Notice, in the above code snippet I highlighted the $redditAuthor lookup as being inefficient.

There's two reasons for this.

Firstly, as mentioned, Doctrine is forced to run a query whenever we query by criteria like we must do to find an existing RedditAuthor. For the moment, let's just accept that we must run this query.

Secondly, when Doctrine gets MySQL to run the real SELECT query, without an index the search will take longer than necessary. This is exactly the sort of problem that you won't catch in development. Arghh, the worst kind of problem - one that only appears in production.

Let me tell you from personal experience, forgetting to add indexes to fields that are searched is going to give you a bad time. Unless you work in some obscure industry I am yet to experience, very much more than likely, you want your project / Symfony-based website to grow. Growth is good. But if you have a database with hundreds of thousands of rows, and you have no indexes, then expect pain.

Thankfully, adding an index in Doctrine is very simple. We just need to add another piece of mapping information to our annotations:

/**
 * @ORM\Entity()
 * @ORM\Table(name="reddit_author", indexes={
 *   @ORM\Index(name="index_author_name", columns={"name"})
 * })
 */
class RedditAuthor
{
    // etc
}

Now, truthfully, an index in this situation is overkill. We only have a few hundred records at most. But still, it is good practice to start thinking about indexes whenever you see or write queries beyond simple lookup by ID / primary key. As mentioned, this is a problem that tends to only rear its ugly head in production, and worse, only as your site starts to grow.

More information on indexes / the @ORM\Index annotation is available here.

Code For This Course

Get the code for this course.

Episodes