Run any website for long enough and you will experience broken links. Even with the best of intentions, once you link out to third party websites, at some point, one (or more) of those links will break. That’s why it’s a good idea to occasionally run your site through a broken link checker.
The following pages contain a write up of the implementation and test approach:
The Problem We Will Be Solving
Keeping track of which links are working and which are broken is an immensely tedious task. Perhaps that is why so many broken links exist out there in Internet land.
However, computers are generally very good at tasks like this. Given a link, visit it, and report back if the status is not 2xx
.
Once you can programmatically visit one link you can very easily visit a whole collection of links. And you can do this on a regular, heck, even scheduled basis.
Of course you can take this yet further still and come up with a way to scrape web pages to find all the links on those pages, and then feed them in to your new broken link checker. So the process repeats.
Handling Redirects
However, there are certain foibles to this process. Certain mischievous links that are deceptive. Redirects.
In a world of appeasing our Google overlords, many sites have taken to cloaking links. Some do this with good intentions – vanity URLs. Others do this in a malicious manner. But the problem is the same. A link appears to be from one place but actually takes you somewhere different.
On your PC, if you hover over the link it should read as:
https://codereviewvideos.com/exercism-codereviewvideos-profile
But if you click the link, you will be taken to:
https://exercism.org/profiles/codereviewvideos
I consider this an acceptable form of link cloaking. It is obvious from the URL what this represents, and in using a link this way I can track how many times the link was clicked. That is useful in a variety of situations, one of which is to find out whether the content I am writing is actually useful and being interacted with.
Problems With Broken Link Checking
The problem, as such, with this kind of link is that if you blindly trust the domain – codereviewvideos.com
in this case – then you would wrongly believe that is an internal link. In the event you’re trying to be clever and automate the checking of your linked URLs, then this may accidentally end up with you queuing up a heck of a lot of unexpected links for visiting.
Trust me, I have made this mistake when building CheckForBrokenLinks.com.
This is a long winded way of me saying that this broken link checker problem interests me. It is a problem I have tackled now in a variety of languages, from PHP to JavaScript, then TypeScript, briefly in Golang and Elixir, and most recently in C#.
Because I understand the problem I am trying to solve, it makes for a good, real world programming problem. The kind I would actually solve. Aka, not a leet code type problem.
Expected Broken Link Checker Output
We should always expect to get back an array of at least one element.
The output should work top to bottom, where the links are visited in order. Therefore the last element in the array will be the end result – the “true” link response.
It is easier to show examples here, rather than try to describe with words.
Immediately Valid Link, No Redirection
[
{
url: 'https://codereviewvideos.com/',
status: 200,
statusText: 'OK',
ok: true,
redirected: false,
headers: {
key: value
}
}
]
Code language: JSON / JSON with Comments (json)
HTTP to HTTPS
[
{
url: 'http://codereviewvideos.com/',
status: 302,
statusText: 'Found',
ok: false,
redirected: true,
headers: {
key: value,
...
location: 'https://codereviewvideos.com/'
}
},
{
url: 'https://codereviewvideos.com/',
status: 200,
statusText: 'OK',
ok: true,
redirected: false,
headers: {
key: value
}
}
]
Code language: JSON / JSON with Comments (json)
404 Page
[
{
url: 'https://codereviewvideos.com/444444',
status: 404,
statusText: 'Not Found',
ok: false,
redirected: false,
headers: {
key: value
}
}
]
Code language: JSON / JSON with Comments (json)
Cloaked Link
[
{
url: 'https://codereviewvideos.com/typescript-tuple',
status: 307,
statusText: 'Temporary Redirect',
ok: false,
redirected: true,
headers: {
location: 'https://www.typescriptlang.org/play#example/tuples',
}
},
{
url: 'https://www.typescriptlang.org/play#example/tuples',
status: 301,
statusText: 'Moved Permanently',
ok: false,
redirected: true,
headers: {
location: 'https://www.typescriptlang.org/play/',
}
},
{
url: 'https://www.typescriptlang.org/play/',
status: 200,
statusText: 'OK',
ok: true,
redirected: false,
headers: {
key: value
}
}
]
Code language: JavaScript (javascript)
Scheme We Won’t Be Able To Process
[
{
"url": "tel:+-303-499-7111",
"status": -1,
"statusText": "Unsupported scheme: tel:",
"ok": false,
"redirected": false,
"headers": {}
}
]
Code language: JSON / JSON with Comments (json)
Bad Input / Broken Links
[
{
"url": "bad input",
"status": -1,
"statusText": "Invalid URL",
"ok": false,
"redirected": false,
"headers": {}
}
]
Code language: JSON / JSON with Comments (json)
Too Many Redirects
An edge case here, but if we get stuck in a redirect loop, or >=50
redirects occur, then we will fail.
[
... 49 previous redirects
{
"url": "...",
"status": -1,
"statusText": "Too many redirects",
"ok": false,
"redirected": true,
"headers": {}
}
]
Code language: JSON / JSON with Comments (json)
I think that covers all the bases. Hope so.
This is a more in-depth and challenging problem to solve than the Map, Filter, Reduce exercise and as such may require several parts to iterate towards the most useful solution.
I shall be attempting to solve the problem(s) in a tested manner, though maybe not 100% test driven development.