Code Review Videos > Kotlin > Kotlin Web Scraping Example

Kotlin Web Scraping Example

Whenever I pick up a new programming language, sooner or later I get comfortable enough with it to want to tackle my favourite / most common personal programming need: web scraping.

Recently I have been playing with Kotlin quite a lot, and am now at the point where I have written a few simple web scrapers, and have enough hands-on experience to feel like I could document my current process.

For this web scraping example we will make use of only a handful of things that can take us surprisingly far:

In this walk through we will scrape HTML table content from the Web Scraper test site.

In order to GET the raw HTML we will make use of the Ktor web client.

And to turn the HTML into actually useful Kotlin objects we will parse the raw HTML through Java’s JSoup.

From there we will have a list of Kotlin data classes that we could continue on to do “interesting things” with. One such thing might be to save that information off to a database. For that, I believe a good choice is the Exposed SQL framework (as recommended in several places, the place I found out first about it being in this book), but I don’t have any personal experience with that just yet, so we won’t be covering that here.

Let’s jump right in.

Project Setup

There are basically three GUI-based, next > next > next type steps we need to complete to get our project off and running.

First, using IntelliJ I create a new Kotlin project:

intellij create new kotlin project

After I do this, the PC fans kick in and it sounds as if we start off down the runway.

Whilst electricity is being consumed at an alarming rate, we might as well do the next essential thing which is to click the link to “upgrade Gradle wrapper to 7.6 version and re-import the project”:

upgrade Gradle wrapper to 7.6 version and re-import the project

As best I can tell, this updates the distributionUrl property inside {root}/gradle/wrapper/gradle-wrapper.properties, and then runs the Gradle task to update the project. Beyond that, it’s basically behind the scenes magic to me.

Finally we need to start the project to actually make sure everything is working:

run the new kotlin project

Great, we see the expected demo output in the “run” window.

Additional Dependencies

There are three extra dependencies that we will need to make this project work.

Here’s the build.gradle.kts file in full with the extra lines highlighted:

plugins {
    kotlin("jvm") version "1.8.20"
    application
}

group = "org.example"
version = "1.0-SNAPSHOT"

repositories {
    mavenCentral()
}

val ktorVersion: String by project

dependencies {
    implementation("io.ktor:ktor-client-core:$ktorVersion")
    implementation("io.ktor:ktor-client-cio:$ktorVersion")
    implementation("org.jsoup:jsoup:1.16.1")

    testImplementation(kotlin("test"))
}

tasks.test {
    useJUnitPlatform()
}

kotlin {
    jvmToolchain(11)
}

application {
    mainClass.set("MainKt")
}
Code language: Kotlin (kotlin)

Line 18 is given as a a Gradle dependency on the JSoup download page, but is listed without the brackets:

// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.16.1'Code language: Java (java)

In order to use this in our Kotlin version of Gradle, all I know is we need brackets. I believe this is because Gradle by default uses Groovy, whereas when we add the .kts extension this becomes a Kotlin Script, and so we have to use Kotlin syntax. Nothing like needless extra complication, eh?

The more interesting lines however are for Ktor.

Lines 13, 16 and 17 are copied from the Ktor docs. As part of those docs you will need to add an entry into gradle.properties:

ktorVersion=2.3.0

Note that I am using camelCase, rather than the suggested snake_case. Why? Well, IntelliJ complained about snake case variable naming, which I found odd considering Ktor is a JetBrains library and this comes from their official docs.

Property name 'ktor_version' should not contain underscores

Line 13 is using property delegation to get the value of the ktorVersion.

kotlin get gradle project property delegate

The by keyword in Kotlin is used for property delegation.

In this case, by project indicates that the ktorVersion property will be delegated to the Gradle project, which means that the value will be obtained from the Gradle configuration.

I’m not saying I fully understand how this works just yet. But I appreciate, from a high level, what it is doing and why.

Once you’ve made these changes, update your elephant be sure to load your Gradle changes:

load gradle changes intellij

OK, project setup and dependencies installed.

Let’s get scraping!

Fetching The Test Page

When setting up a web scraper I don’t particularly like firing requests at the live site. Often times, companies do not appreciate us web scrapers harvesting their lovely hand crafted web pages.

So it is usually a good idea to browse manually to the website, right click the page, view the page source, and copy / paste their HTML into a text file. If the web page is a little more dynamic and modern, switch that process out for right click > inspect > elements > right click and ‘edit as html’:

edit as html for web scraping

In this example, however, we have a basic HTML page that doesn’t require that extra complexity.

If you do go down the route of saving the HTML to a local file, you will then need to amend your code to load from the file rather than make a GET request each time you run the scraper. That is actually not a bad thing – it brings more modularity to your code (think: code to an interface, not an implementation).

But because this is a demo page and they expect us to be scrapers, we will simply make a new GET request every time we run our app.

Let’s do that now:

import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.statement.*

suspend fun main() {
    val url = "https://webscraper.io/test-sites/tables/tables-semantically-correct"
    
    val client = HttpClient(CIO)
    val response: HttpResponse = client.get(url)
    client.close()

    val html = response.bodyAsText()

    println(response.status)
    println(html)
}
Code language: Kotlin (kotlin)

To begin, we need to import the necessary classes from the io.ktor.client and io.ktor.client.engine.cio packages. These classes will enable us to work with the Ktor HTTP client and handle HTTP requests effectively.

One important keyword used in this code is suspend before the main function. This keyword indicates that the function is a suspending function, allowing it to be suspended and resumed later without blocking the thread. This is particularly useful when dealing with asynchronous operations, such as making an HTTP request.

Inside the main function, we assign a URL string to the url variable. For this example, we’ve set it to https://webscraper.io/test-sites/tables/tables-semantically-correct. This URL represents the target website to which we’ll send the GET request. I’ve only split this out into its own variable as otherwise the code snippet has a horizontal scroll 🤮

To handle the HTTP request, we create an instance of the HTTP client using HttpClient(CIO). Here, the CIO parameter specifies that the Ktor client should use the CIO (Coroutine I/O) engine for handling HTTP requests. I’ve been doing a lot of reading into coroutines, and if we were doing this without Ktor we should be using Dispatchers.IO which is the coroutine dispatcher optimised for Input / Output (IO), such as network requests.

Next, we use the client.get(url) statement to send the GET request to the specified URL. This method initiates the request and returns an HttpResponse object, which we assign to the response variable. This object contains information about the response, including the HTTP status code. We will print some of this out to help take this code further.

After receiving the response, it’s important to close the HTTP client using the client.close() statement. This ensures that any underlying resources associated with the client are properly released.

To access the content of the response, we call the response.bodyAsText() method. This method retrieves the response body as a text string, which we assign to the html variable. Now we have the HTML content of the response to play around with.

We can display some useful information by printing the response status code using println(response.status). The status property of the HttpResponse object contains the HTTP status code, which provides insights into the outcome of the request.

Finally, we print the HTML content itself using println(html). This allows us to see the actual data retrieved from the remote server.

output from the initial run

In your console you should be able to see the HTTP response status code (200 OK here), and then the full HTML of the page.

Initial Refactoring

Although this code works, I think we should handle any situation in which it doesn’t for some reason.

To do this, I will wrap the get request in a try / catch and then use finally to ensure the client is always closed down properly:

import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.statement.*

suspend fun main() {
    val url = "https://webscraper.io/test-sites/tables/tables-semantically-correct"

    val client = HttpClient(CIO)
    try {
        val response: HttpResponse = client.get(url)
        val html = response.bodyAsText()

        println(response.status)
        println(html)
    } finally {
        client.close()
    }
}Code language: Kotlin (kotlin)

The client needs to be initialised outside of the try block, so it can also be called in the finally block.

Now the finally block will always be executed, whether success or some exception occurs.

This ensures that resources are released regardless of whether an exception is thrown or not.

Scraping The Table Data Using JSoup

On the test page we have a couple of tables. I am really not bothered about the second table. It’s good that it is there, because it allows you to take things further, if you are so inclined.

But for the purposes of this post, I am only going to focus on extracting data from the first table:

what we will be scraping using kotlin and jsoup

Somewhere in the returned HTML, we already have this… as text.

Now we need to use JSoup to turn it into objects we can work with.

In order to do that, we need to find a CSS selector we can use to target the table we care about. Looking at the HTML, each of the table‘s lacks an id (the easiest way to target unique page elements), but does contain a pair of common CSS classes:

table playground css classes

Which gives us either .table, or .table-bordered to go off.

A handy tip inside your browser is that you can right click elements and get the CSS selector if you are at all unsure:

copy css selector from chrome

In the above screenshot we are very specifically targeting the first table, the first row, and the second column. The CSS selector for this would be:

body > div.wrapper > div.container > div > div > div.col-md-9 > div > div.tables-semantically-correct > table:nth-child(3) > tbody > tr:nth-child(1) > td:nth-child(2)Code language: CSS (css)

Good luck figuring that out without help from your browser.

But for me, I try to keep my selectors as simple as possible. Often you already know, just by looking, what a good selector would be.

In this case we have already identified that .table or .table-bordered would be good.

The problem, as such, when using CSS classes is that there can be zero or more of them. With an id there should only ever be zero or one. Well, unless the HTML is malformed. But that’s a more advanced topic.

In this case, we will select the first element that matches .table-bordered. And that’s simply a case of reading the JSoup selector syntax docs:

val html = response.bodyAsText()

val document = Jsoup.parse(html)
val table = document.select(".table-bordered").first()Code language: Kotlin (kotlin)

However we have just said that an element may or may not exist.

Which means table could just as easily be null as some form of JSoup Element.

What If The Element Doesn’t Exist?

Taking into account the potential for the table to be null, we can use the safe call operator (?.) to continue on with another select:

table?.select("tbody tr")?.forEach { row ->
    val cells = row.select("td")

    val id = cells[0].text()
    val firstName = cells[1].text()
    val lastName = cells[2].text()
    val username = cells[3].text()
}Code language: Kotlin (kotlin)

This is one way to extract data from HTML table rows in Kotlin. Using the Safe Call operator and a forEach loop to iterate over the table rows and access the individual table cell data.

The Safe Call operator ?. is used to ensure that the code proceeds only if table is not null.

The select("tbody tr") method is called on the table, selecting all tr elements within the tbody.

The subsequent ?. operator guarantees that the following code executes only if the selected rows are not also null.

The forEach function is then called on the rows, and the code within the lambda expression is executed for each row.

Inside the lambda expression ({ row -> ... }), the code processes each row of the table.

The line val cells = row.select("td") selects all td elements within the current row and assigns them to the cells variable.

To extract specific cell data, the code uses indexing and the text() method.

Check and Check Again

When scraping, you have to be defensive.

It’s not so bad when we have actually seen the screen. We know the table is valid. We know there are three rows with four columns each.

In the real world, most of the time, you won’t know.

At least, you can’t be absolutely sure.

So what do you do?

You plan for the worst.

Which means you have to check everything, or risk things blowing up. The more pages you scrape (and sometimes we will scrape 1000’s), the higher the risk of something going wrong.

table?.select("tbody tr")?.forEach { row
    val cells = row.select("td")

    if (cells.size != 4) {
        return@forEach
    }

    val id = cells[0].text()
    val firstName = cells[1].text()
    val lastName = cells[2].text()
    val username = cells[3].text()
}
Code language: Kotlin (kotlin)

It’s not the best possible check, but it illustrates the point.

We check that the total number of cells found for this row should be exactly four.

If not, we return. But note the return@forEach.

This is a labelled return statement. This syntax ensures we return from the current lambda, and continue on with the next. It’s very similar to continue in JavaScript.

If we only used return here, we would return from the entire enclosing function.

You could validate this yourself with a slightly different forEach – this time using forEachIndexed – which makes the process much easier to control:

table?.select("tbody tr")?.forEachIndexed { index, row ->
    val cells = row.select("td")

    if (index == 1) {
        return@forEachIndexed
    }

    val id = cells[0].text()
    val firstName = cells[1].text()
    val lastName = cells[2].text()
    val username = cells[3].text()
}
Code language: JavaScript (javascript)

Storing The Scraped Data

For the purposes of this example, we will store the scraped data in a class, and then store each class in a list structure.

That’s good enough for now, but in reality you would likely store your data into some database or similar.

Here’s our data class:

data class UserData(
    val id: Int,
    val firstName: String,
    val lastName: String,
    val username: String
)Code language: Kotlin (kotlin)

And the modified code to create and store this UserData:

val userList = mutableListOf<UserData>()

table?.select("tbody tr")?.forEach { row ->
    // ...

    val id = cells[0].text()
    val firstName = cells[1].text()
    // ...

    userList.add(UserData(id, firstName, lastName, username))
}

for (user in userList) {
    println(user)
}
Code language: Kotlin (kotlin)

Pretty straightforward.

We have a data class that specifies the properties we are scraping, and their desired types. That data class is defined outside the main function – a full sample of the code will be shown at the end, if you are at all unclear.

Note now that the UserData class defines id as an Int. Currently this is being scraped from the page as a string.

This is really easy to fix:

val id = cells[0].text().toInt()Code language: Kotlin (kotlin)

With that, we should be able to run the scraper and see the following output:

UserData(id=1, firstName=Mark, lastName=Otto, username=@mdo)
UserData(id=2, firstName=Jacob, lastName=Thornton, username=@fat)
UserData(id=3, firstName=Larry, lastName=the Bird, username=@twitter)

Process finished with exit code 0Code language: JavaScript (javascript)

Sweet.

Final Refactoring

I’m pretty happy with this code.

There are plenty of ways to improve this, but for a single demo function that proves we can:

  • GET web pages
  • Parse the HTML to extract interesting content
  • Store that content in a Kotlin friendly format

This is a pretty good start.

Obviously it’s not production code. It’s code written by someone who is very much still learning their way around Kotlin, and by extension, Java.

However, whilst I might be pretty happy with this, IntelliJ is suggesting an improvement:

try-finally can be replaced with 'use()':

kotlin try-finally can be replaced with 'use()'

Yes, it seems that try / catch is old hat. A better abstraction over manually wrapping our code in try / finally is to use use.

Here’s the change:

val httpClient = HttpClient(CIO)
httpClient.use { client ->
    // ...
}Code language: Kotlin (kotlin)

And I figured I could go one step further by removing the need for the httpClient variable:

HttpClient(CIO).use { client ->
    // ...
}Code language: Kotlin (kotlin)

I can’t remember where I read about use, but I know it relies on the thing being used to implement the Closeable interface.

Basically what’s happening here is that instead of use manually having to manage the client connection, if we use the client then Kotlin will automatically handle the close for us, whether the result is successful or throwing.

We can even ctrl + click the use function and see the implementation:

kotlin use closeable code

The use code doesn’t handle the exception as best I can see. Only line 30 there it seems to re-throw the exception. So we would still need to handle that at a guess.

But anyway, that’s good enough I feel.

Final Code

Here’s the scraper code we ended up with:

import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.statement.*
import org.jsoup.Jsoup

data class UserData(
    val id: Int,
    val firstName: String,
    val lastName: String,
    val username: String
)

suspend fun main() {
    val url = "https://webscraper.io/test-sites/tables/tables-semantically-correct"

    HttpClient(CIO).use { client ->
        val response: HttpResponse = client.get(url)

        val html = response.bodyAsText()

        val document = Jsoup.parse(html)
        val table = document.select(".table-bordered").first()

        val userList = mutableListOf<UserData>()

        table?.select("tbody tr")?.forEach { row ->
            val cells = row.select("td")

            if (cells.size != 4) {
                return@forEach
            }

            val id = cells[0].text().toInt()
            val firstName = cells[1].text()
            val lastName = cells[2].text()
            val username = cells[3].text()

            userList.add(UserData(id, firstName, lastName, username))
        }

        for (user in userList) {
            println(user)
        }
    }
}Code language: Kotlin (kotlin)

As I said above, there’s lots of ways to improve this. We’re breaking all manner of good coding principles, not least of which here is the Single Responsibility principle.

However, for a first stab at web scraping with Kotlin, this does show how few lines of code are required. I quite like it, honestly.

GitHub Repo

https://github.com/codereviewvideos/kotlin-web-scraping-example

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.