Whenever I pick up a new programming language, sooner or later I get comfortable enough with it to want to tackle my favourite / most common personal programming need: web scraping.
Recently I have been playing with Kotlin quite a lot, and am now at the point where I have written a few simple web scrapers, and have enough hands-on experience to feel like I could document my current process.
For this web scraping example we will make use of only a handful of things that can take us surprisingly far:
In this walk through we will scrape HTML table content from the Web Scraper test site.
In order to GET
the raw HTML we will make use of the Ktor web client.
And to turn the HTML into actually useful Kotlin objects we will parse the raw HTML through Java’s JSoup.
From there we will have a list of Kotlin data classes that we could continue on to do “interesting things” with. One such thing might be to save that information off to a database. For that, I believe a good choice is the Exposed SQL framework (as recommended in several places, the place I found out first about it being in this book), but I don’t have any personal experience with that just yet, so we won’t be covering that here.
Let’s jump right in.
Project Setup
There are basically three GUI-based, next > next > next type steps we need to complete to get our project off and running.
First, using IntelliJ I create a new Kotlin project:
After I do this, the PC fans kick in and it sounds as if we start off down the runway.
Whilst electricity is being consumed at an alarming rate, we might as well do the next essential thing which is to click the link to “upgrade Gradle wrapper to 7.6 version and re-import the project”:
As best I can tell, this updates the distributionUrl
property inside {root}/gradle/wrapper/gradle-wrapper.properties
, and then runs the Gradle task to update the project. Beyond that, it’s basically behind the scenes magic to me.
Finally we need to start the project to actually make sure everything is working:
Great, we see the expected demo output in the “run” window.
Additional Dependencies
There are three extra dependencies that we will need to make this project work.
Here’s the build.gradle.kts
file in full with the extra lines highlighted:
plugins {
kotlin("jvm") version "1.8.20"
application
}
group = "org.example"
version = "1.0-SNAPSHOT"
repositories {
mavenCentral()
}
val ktorVersion: String by project
dependencies {
implementation("io.ktor:ktor-client-core:$ktorVersion")
implementation("io.ktor:ktor-client-cio:$ktorVersion")
implementation("org.jsoup:jsoup:1.16.1")
testImplementation(kotlin("test"))
}
tasks.test {
useJUnitPlatform()
}
kotlin {
jvmToolchain(11)
}
application {
mainClass.set("MainKt")
}
Code language: Kotlin (kotlin)
Line 18 is given as a a Gradle dependency on the JSoup download page, but is listed without the brackets:
// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.16.1'
Code language: Java (java)
In order to use this in our Kotlin version of Gradle, all I know is we need brackets. I believe this is because Gradle by default uses Groovy, whereas when we add the .kts
extension this becomes a Kotlin Script, and so we have to use Kotlin syntax. Nothing like needless extra complication, eh?
The more interesting lines however are for Ktor.
Lines 13, 16 and 17 are copied from the Ktor docs. As part of those docs you will need to add an entry into gradle.properties
:
ktorVersion=2.3.0
Note that I am using camelCase, rather than the suggested snake_case. Why? Well, IntelliJ complained about snake case variable naming, which I found odd considering Ktor is a JetBrains library and this comes from their official docs.
Line 13 is using property delegation to get the value of the ktorVersion
.
The by
keyword in Kotlin is used for property delegation.
In this case, by project
indicates that the ktorVersion
property will be delegated to the Gradle project, which means that the value will be obtained from the Gradle configuration.
I’m not saying I fully understand how this works just yet. But I appreciate, from a high level, what it is doing and why.
Once you’ve made these changes, update your elephant be sure to load your Gradle changes:
OK, project setup and dependencies installed.
Let’s get scraping!
Fetching The Test Page
When setting up a web scraper I don’t particularly like firing requests at the live site. Often times, companies do not appreciate us web scrapers harvesting their lovely hand crafted web pages.
So it is usually a good idea to browse manually to the website, right click the page, view the page source, and copy / paste their HTML into a text file. If the web page is a little more dynamic and modern, switch that process out for right click > inspect > elements > right click and ‘edit as html’:
In this example, however, we have a basic HTML page that doesn’t require that extra complexity.
If you do go down the route of saving the HTML to a local file, you will then need to amend your code to load from the file rather than make a GET
request each time you run the scraper. That is actually not a bad thing – it brings more modularity to your code (think: code to an interface, not an implementation).
But because this is a demo page and they expect us to be scrapers, we will simply make a new GET
request every time we run our app.
Let’s do that now:
import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.statement.*
suspend fun main() {
val url = "https://webscraper.io/test-sites/tables/tables-semantically-correct"
val client = HttpClient(CIO)
val response: HttpResponse = client.get(url)
client.close()
val html = response.bodyAsText()
println(response.status)
println(html)
}
Code language: Kotlin (kotlin)
To begin, we need to import the necessary classes from the io.ktor.client
and io.ktor.client.engine.cio
packages. These classes will enable us to work with the Ktor HTTP client and handle HTTP requests effectively.
One important keyword used in this code is suspend
before the main
function. This keyword indicates that the function is a suspending function, allowing it to be suspended and resumed later without blocking the thread. This is particularly useful when dealing with asynchronous operations, such as making an HTTP request.
Inside the main
function, we assign a URL string to the url
variable. For this example, we’ve set it to https://webscraper.io/test-sites/tables/tables-semantically-correct. This URL represents the target website to which we’ll send the GET request. I’ve only split this out into its own variable as otherwise the code snippet has a horizontal scroll 🤮
To handle the HTTP request, we create an instance of the HTTP client using HttpClient(CIO)
. Here, the CIO
parameter specifies that the Ktor client should use the CIO (Coroutine I/O) engine for handling HTTP requests. I’ve been doing a lot of reading into coroutines, and if we were doing this without Ktor we should be using Dispatchers.IO
which is the coroutine dispatcher optimised for Input / Output (IO), such as network requests.
Next, we use the client.get(url)
statement to send the GET request to the specified URL. This method initiates the request and returns an HttpResponse
object, which we assign to the response
variable. This object contains information about the response, including the HTTP status code. We will print some of this out to help take this code further.
After receiving the response, it’s important to close the HTTP client using the client.close()
statement. This ensures that any underlying resources associated with the client are properly released.
To access the content of the response, we call the response.bodyAsText()
method. This method retrieves the response body as a text string, which we assign to the html
variable. Now we have the HTML content of the response to play around with.
We can display some useful information by printing the response status code using println(response.status)
. The status
property of the HttpResponse
object contains the HTTP status code, which provides insights into the outcome of the request.
Finally, we print the HTML content itself using println(html)
. This allows us to see the actual data retrieved from the remote server.
In your console you should be able to see the HTTP response status code (200 OK
here), and then the full HTML of the page.
Initial Refactoring
Although this code works, I think we should handle any situation in which it doesn’t for some reason.
To do this, I will wrap the get
request in a try
/ catch
and then use finally
to ensure the client
is always closed down properly:
import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.statement.*
suspend fun main() {
val url = "https://webscraper.io/test-sites/tables/tables-semantically-correct"
val client = HttpClient(CIO)
try {
val response: HttpResponse = client.get(url)
val html = response.bodyAsText()
println(response.status)
println(html)
} finally {
client.close()
}
}
Code language: Kotlin (kotlin)
The client
needs to be initialised outside of the try
block, so it can also be called in the finally
block.
Now the finally
block will always be executed, whether success or some exception occurs.
This ensures that resources are released regardless of whether an exception is thrown or not.
Scraping The Table Data Using JSoup
On the test page we have a couple of tables. I am really not bothered about the second table. It’s good that it is there, because it allows you to take things further, if you are so inclined.
But for the purposes of this post, I am only going to focus on extracting data from the first table:
Somewhere in the returned HTML, we already have this… as text.
Now we need to use JSoup to turn it into objects we can work with.
In order to do that, we need to find a CSS selector we can use to target the table we care about. Looking at the HTML, each of the table
‘s lacks an id
(the easiest way to target unique page elements), but does contain a pair of common CSS classes:
Which gives us either .table
, or .table-bordered
to go off.
A handy tip inside your browser is that you can right click elements and get the CSS selector if you are at all unsure:
In the above screenshot we are very specifically targeting the first table, the first row, and the second column. The CSS selector for this would be:
body > div.wrapper > div.container > div > div > div.col-md-9 > div > div.tables-semantically-correct > table:nth-child(3) > tbody > tr:nth-child(1) > td:nth-child(2)
Code language: CSS (css)
Good luck figuring that out without help from your browser.
But for me, I try to keep my selectors as simple as possible. Often you already know, just by looking, what a good selector would be.
In this case we have already identified that .table
or .table-bordered
would be good.
The problem, as such, when using CSS classes is that there can be zero or more of them. With an id
there should only ever be zero or one. Well, unless the HTML is malformed. But that’s a more advanced topic.
In this case, we will select the first element that matches .table-bordered
. And that’s simply a case of reading the JSoup selector syntax docs:
val html = response.bodyAsText()
val document = Jsoup.parse(html)
val table = document.select(".table-bordered").first()
Code language: Kotlin (kotlin)
However we have just said that an element may or may not exist.
Which means table
could just as easily be null
as some form of JSoup Element
.
What If The Element Doesn’t Exist?
Taking into account the potential for the table
to be null, we can use the safe call operator (?.
) to continue on with another select
:
table?.select("tbody tr")?.forEach { row ->
val cells = row.select("td")
val id = cells[0].text()
val firstName = cells[1].text()
val lastName = cells[2].text()
val username = cells[3].text()
}
Code language: Kotlin (kotlin)
This is one way to extract data from HTML table rows in Kotlin. Using the Safe Call operator and a forEach
loop to iterate over the table rows and access the individual table cell data.
The Safe Call operator ?.
is used to ensure that the code proceeds only if table
is not null.
The select("tbody tr")
method is called on the table, selecting all tr
elements within the tbody
.
The subsequent ?.
operator guarantees that the following code executes only if the selected rows are not also null
.
The forEach
function is then called on the rows, and the code within the lambda expression is executed for each row.
Inside the lambda expression ({ row -> ... }
), the code processes each row of the table.
The line val cells = row.select("td")
selects all td
elements within the current row and assigns them to the cells
variable.
To extract specific cell data, the code uses indexing and the text()
method.
Check and Check Again
When scraping, you have to be defensive.
It’s not so bad when we have actually seen the screen. We know the table is valid. We know there are three rows with four columns each.
In the real world, most of the time, you won’t know.
At least, you can’t be absolutely sure.
So what do you do?
You plan for the worst.
Which means you have to check everything, or risk things blowing up. The more pages you scrape (and sometimes we will scrape 1000’s), the higher the risk of something going wrong.
table?.select("tbody tr")?.forEach { row
val cells = row.select("td")
if (cells.size != 4) {
return@forEach
}
val id = cells[0].text()
val firstName = cells[1].text()
val lastName = cells[2].text()
val username = cells[3].text()
}
Code language: Kotlin (kotlin)
It’s not the best possible check, but it illustrates the point.
We check that the total number of cells found for this row should be exactly four.
If not, we return. But note the return@forEach
.
This is a labelled return statement. This syntax ensures we return from the current lambda, and continue on with the next. It’s very similar to continue
in JavaScript.
If we only used return
here, we would return from the entire enclosing function.
You could validate this yourself with a slightly different forEach
– this time using forEachIndexed
– which makes the process much easier to control:
table?.select("tbody tr")?.forEachIndexed { index, row ->
val cells = row.select("td")
if (index == 1) {
return@forEachIndexed
}
val id = cells[0].text()
val firstName = cells[1].text()
val lastName = cells[2].text()
val username = cells[3].text()
}
Code language: JavaScript (javascript)
Storing The Scraped Data
For the purposes of this example, we will store the scraped data in a class, and then store each class in a list structure.
That’s good enough for now, but in reality you would likely store your data into some database or similar.
Here’s our data class:
data class UserData(
val id: Int,
val firstName: String,
val lastName: String,
val username: String
)
Code language: Kotlin (kotlin)
And the modified code to create and store this UserData
:
val userList = mutableListOf<UserData>()
table?.select("tbody tr")?.forEach { row ->
// ...
val id = cells[0].text()
val firstName = cells[1].text()
// ...
userList.add(UserData(id, firstName, lastName, username))
}
for (user in userList) {
println(user)
}
Code language: Kotlin (kotlin)
Pretty straightforward.
We have a data class that specifies the properties we are scraping, and their desired types. That data class
is defined outside the main
function – a full sample of the code will be shown at the end, if you are at all unclear.
Note now that the UserData
class defines id
as an Int
. Currently this is being scraped from the page as a string.
This is really easy to fix:
val id = cells[0].text().toInt()
Code language: Kotlin (kotlin)
With that, we should be able to run the scraper and see the following output:
UserData(id=1, firstName=Mark, lastName=Otto, username=@mdo)
UserData(id=2, firstName=Jacob, lastName=Thornton, username=@fat)
UserData(id=3, firstName=Larry, lastName=the Bird, username=@twitter)
Process finished with exit code 0
Code language: JavaScript (javascript)
Sweet.
Final Refactoring
I’m pretty happy with this code.
There are plenty of ways to improve this, but for a single demo function that proves we can:
GET
web pages- Parse the HTML to extract interesting content
- Store that content in a Kotlin friendly format
This is a pretty good start.
Obviously it’s not production code. It’s code written by someone who is very much still learning their way around Kotlin, and by extension, Java.
However, whilst I might be pretty happy with this, IntelliJ is suggesting an improvement:
try-finally can be replaced with 'use()'
:
Yes, it seems that try
/ catch
is old hat. A better abstraction over manually wrapping our code in try
/ finally
is to use use
.
Here’s the change:
val httpClient = HttpClient(CIO)
httpClient.use { client ->
// ...
}
Code language: Kotlin (kotlin)
And I figured I could go one step further by removing the need for the httpClient
variable:
HttpClient(CIO).use { client ->
// ...
}
Code language: Kotlin (kotlin)
I can’t remember where I read about use
, but I know it relies on the thing being use
d to implement the Closeable
interface.
Basically what’s happening here is that instead of use manually having to manage the client connection, if we use
the client then Kotlin will automatically handle the close for us, whether the result is successful or throwing.
We can even ctrl
+ click the use
function and see the implementation:
The use
code doesn’t handle the exception as best I can see. Only line 30
there it seems to re-throw
the exception. So we would still need to handle that at a guess.
But anyway, that’s good enough I feel.
Final Code
Here’s the scraper code we ended up with:
import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.statement.*
import org.jsoup.Jsoup
data class UserData(
val id: Int,
val firstName: String,
val lastName: String,
val username: String
)
suspend fun main() {
val url = "https://webscraper.io/test-sites/tables/tables-semantically-correct"
HttpClient(CIO).use { client ->
val response: HttpResponse = client.get(url)
val html = response.bodyAsText()
val document = Jsoup.parse(html)
val table = document.select(".table-bordered").first()
val userList = mutableListOf<UserData>()
table?.select("tbody tr")?.forEach { row ->
val cells = row.select("td")
if (cells.size != 4) {
return@forEach
}
val id = cells[0].text().toInt()
val firstName = cells[1].text()
val lastName = cells[2].text()
val username = cells[3].text()
userList.add(UserData(id, firstName, lastName, username))
}
for (user in userList) {
println(user)
}
}
}
Code language: Kotlin (kotlin)
As I said above, there’s lots of ways to improve this. We’re breaking all manner of good coding principles, not least of which here is the Single Responsibility principle.
However, for a first stab at web scraping with Kotlin, this does show how few lines of code are required. I quite like it, honestly.
GitHub Repo
https://github.com/codereviewvideos/kotlin-web-scraping-example