Hands On - Scraping Reddit to Explain Entity Relationships


In this video we are going to start working with Doctrine relationships. For each of our RedditPost entities we will have an associated RedditAuthor entity.

To make this more interesting, we will also build a little Reddit scraper that will grab the first few pages of the subreddit of your choice (/r/php in my case), and then create new entities based on this scraped data.

Talking to the Reddit API will be done using Guzzle, in a very similar manner to the way we got real data from GitHub in the GitHut series. However, we will expand and improve on this in the next video.

We are likely to hit on a few challenges along the way. What happens if we scrape two or more RedditPost's from the same RedditAuthor? We don't want duplicates in our database, so we will need to think of a way to mitigate this.

One way to remove duplicates may be to collect all the scraped results together into a single array, then use a function such as array_unique to remove any duplicate values. With a bit of thought, this may work fine initially. The problem here would be if we were to re-run the scraper. The second time through, the array wouldn't know of any existing entries in the database - only the data from this current scrape run. Hmmm, not so good.

Another way would be to make our RedditAuthor entities unique somehow, ensuring a duplicate cannot enter the underlying database table. However, we are then going to need to run a query against the database for every single RedditAuthor that we scrape. This could become extremely slow as our database table grows in size. To dramatically reduce this slowdown, we will need to add an index to the table, which will make these searches much, much quicker.

There are always trade offs to be made when developing a software system. This is only a small example, purely for demonstration. I haven't made plans for Reddit going offline, or resuming a scrape, or anything like that. Even so, we've already had to make some guesses (ahem, design decisions!) based on the currently available information.

A First Guess At RedditAuthor

We already have our RedditPost entity, so we are all set up to scrape some post information. However, we also want to associate any given RedditPost with the post's Author. We will therefore want to create a new RedditAuthor entity:

<?php

// src/AppBundle/Entity/RedditAuthor.php

namespace AppBundle\Entity;

use Doctrine\Common\Collections\ArrayCollection;
use Doctrine\ORM\Mapping as ORM;

/**
 * @ORM\Entity
 * @ORM\Table(name="reddit_authors")
 */
class RedditAuthor
{
    /**
     * @ORM\Column(type="integer")
     * @ORM\Id
     * @ORM\GeneratedValue(strategy="AUTO")
     */
    protected $id;

    /**
     * @ORM\Column(type="string")
     */
    protected $name;

    /**
     * @return mixed
     */
    public function getId()
    {
        return $this->id;
    }

    /**
     * @return mixed
     */
    public function getName()
    {
        return $this->name;
    }

    /**
     * @param mixed $name
     * @return RedditAuthor
     */
    public function setName($name)
    {
        $this->name = $name;

        return $this;
    }
}

This is the first draft of the RedditAuthor entity. Much like the RedditPost entity from the first video in this series, our entity has an auto-incrementing ID, and a simple string property to store the author's name.

This being a valid entity definition, we can go ahead and get Doctrine to update the underlying database for us, adding in the new table, creating the columns, and so on:

php bin/console doctrine:schema:update --force

As a side note here, one thing to be aware of as your project grows, is that you would likely want to manage your database schema changes in a more organised fashion. Running doctrine:schema:update is a potentially dangerous command, and instead, I would advise you consider using a database migration strategy.

Preparing To Scrape Reddit

With a basic entity set up, we now have somewhere we can save off a post's author information. There's no relationship configured yet, or indexing, or anything like that. In other words, this is going to break quite quickly. But let's pretend we don't know that yet, and just continue on.

In a more real world application, we would likely want to create a Symfony Console Command rather than placing the scraping call in a standard controller action. There are a number of reasons for this:

  • being able to call a console command from a cron (scheduled) job;
  • it having nothing to display on the front end;
  • scraping by calling a URL doesn't make much sense;

And so on. However, we haven't covered console commands as of yet here on CodeReviewVideos, and as this series is aimed at developers who are new to Symfony, adding in console commands at this stage would only make learning more difficult. But we will get to this in a future series, don't worry!

That said, we are still going to declare a new Symfony Service for our Reddit Scraper. It doesn't matter that the call to trigger the scrape comes from browsing to a URL, or running a console command, or any other way you can think of. The real logic that runs the scrape will be centralised into a single services called reddit_scraper.

Declaring the reddit_scraper service is very similar to how we configured our GitHub scraper service in the previous tutorial series:

# app/config/services.yml

services:
    reddit_scraper:
        class: AppBundle\Service\RedditScraper
        arguments:
            - "@doctrine.orm.default_entity_manager"

There is one difference this time: we are injecting Doctrine's Entity Manager.

By using this convention, the entity manager will be available to us as a parameter on the service's constructor:

<?php

// src/AppBundle/Service/RedditScraper.php

namespace AppBundle\Service;

use Doctrine\ORM\EntityManagerInterface;

class RedditScraper
{
    /**
     * @var EntityManagerInterface
     */
    private $em;

    public function __construct(EntityManagerInterface $em)
    {
        $this->em = $em;
    }
}

Literally any configured service inside your application's container is available to be injected in this way. This includes reddit_scraper. We declare a new service, and could then create another new service, and inject reddit_scraper into that service:

# app/config/services.yml

services:
    reddit_scraper:
        class: AppBundle\Service\RedditScraper
        arguments:
            - "@doctrine.orm.default_entity_manager"

    some_other_service:
        class: DifferentBundle\Service\AnotherService
        arguments:
            - "@reddit_scraper"

Pretty cool.

Notice though, we talked earlier about using Guzzle for making requests to the Reddit API. Yet, we are not injecting Guzzle. This is something we will correct soon, but for now, we will create a new instance of a Guzzle Client in the service as needed.

To work with Guzzle we need to define it as a requirement of our project. We do this using composer, which in turn will update the composer.json and composer.lock files. Again, if this is new to you, please do watch the previous series where this is discussed in more depth.

composer require guzzlehttp/guzzle

This will download Guzzle for us, storing the files in your vendor directory.

Once Guzzle is in your project, you can start using it. To do this, we need an instance of the Guzzle Client:

// src/AppBundle/Service/RedditScraper.php

    public function scrape()
    {
        $client = new \GuzzleHttp\Client();

        $response = $client->request('GET', 'https://api.reddit.com/r/php.json');

        $data = json_decode($response->getBody()->getContents(), true);

        return $data;
    }

At this stage, we should now be getting back the first page of results from Reddit's /r/php subreddit, stored as an associative array (or just an array, to you and me) of data.

In the next video we will start looping through this data to create our RedditPost and RedditAuthor entities, and define relationships based on this information.

Code For This Course

Get the code for this course.

Episodes