Micro Blog in Java on Exercism

You have identified a gap in the social media market for very very short posts. Now that Twitter allows 280 character posts, people wanting quick social media updates aren’t being served. You decide to create your own social media network.

To make your product noteworthy, you make it extreme and only allow posts of 5 or less characters. Any posts of more than 5 characters should be truncated to 5.

To allow your users to express themselves fully, you allow Emoji and other Unicode.

The task is to truncate input strings to 5 characters.

Text Encodings

Text stored digitally has to be converted to a series of bytes. There are 3 ways to map characters to bytes in common use.

ASCII can encode English language characters. All characters are precisely 1 byte long.
UTF-8 is a Unicode text encoding. Characters take between 1 and 4 bytes.
UTF-16 is a Unicode text encoding. Characters are either 2 or 4 bytes long.

UTF-8 and UTF-16 are both Unicode encodings which means they’re capable of representing a massive range of characters including:

Text in most of the world’s languages and scripts
Historic text
Emoji

UTF-8 and UTF-16 are both variable length encodings, which means that different characters take up different amounts of space.

Consider the letter ‘a’ and the emoji ‘😛’. In UTF-16 the letter takes 2 bytes but the emoji takes 4 bytes.

The trick to this exercise is to use APIs designed around Unicode characters (codepoints) instead of Unicode codeunits.

Initial Thoughts

In theory this one shouldn’t be that hard. But it’s not the very first real exercise on the Java track. As above, it’s the 10th choice in the Strings section.

I didn’t pick it because it sounded easy. However it is flagged as an Easy puzzle.

I picked it because this is closer to the sorts of problems I tend to solve for my own projects, rather than the more common maths puzzles that many code puzzle sites seem to favour.

The most useful piece of the Readme for this challenge is that final line:

The trick to this exercise is to use APIs designed around Unicode characters (codepoints) instead of Unicode codeunits.
Readme.md for Micro Blog code puzzle in the Java track on Exercism

I don’t know much about Java at all. That was a handy pointer, but I figured Googling would be involved if IntelliJ wasn’t up for helping me.

Test #1: englishLanguageShort

Exercism works by giving you a starting project with several unit tests.

The unit tests arrive failing and it is your job to make all of those tests pass.

I suspect you already know this, or you wouldn’t be reading, so I will stop with the pointless intro and get right into the first test:

import org.junit.Test;

import static org.assertj.core.api.Assertions.assertThat;

public class MicroBlogTest {

    private final MicroBlog microBlog = new MicroBlog();

    @Test
    public void englishLanguageShort() {
        String expected = "Hi";
        assertThat(microBlog.truncate("Hi")).isEqualTo(expected);
    }Code language: Java (java)

OK, so you can be a bit of a smart arse here and solve this problem in the most simple way:

class MicroBlog {
    public String truncate(String input) {
        return "Hi";
    }
}Code language: Java (java)

That’s a pass.

It sounds like the above approach is cheating, but with TDD the best course of action is to do the very least amount to make the test pass, and then possibly look to refactor. That is the least amount of work I can do to make the test pass, and I can’t think of a better implementation… based only on the requirements of the first test.

Test #2: englishLanguageLong

We know that first implementation won’t hold for long.

Here’s the second test:

    @Test
    public void englishLanguageLong() {
        String expected = "Hello";
        assertThat(microBlog.truncate("Hello there")).isEqualTo(expected);
    }Code language: Java (java)

And so our test immediately fails:

Expected :"Hello"
Actual   :"Hi"
<Click to see difference>Code language: JavaScript (javascript)

Right-o, the real work can begin.

My first thought would have been substring, or whatever Java has in that area. But the Readme pointer about codepoints has me pretty much knowing that won’t be the right approach.

Still, let’s go with the easiest, for now:

class MicroBlog {
    public String truncate(String input) {
        return input.substring(0, 5);
    }
}Code language: Java (java)

This passes for the second test in isolation.

However it now breaks the first test.

> Task :test FAILED
MicroBlogTest > englishLanguageShort FAILED
    java.lang.StringIndexOutOfBoundsException: Range [0, 5) out of bounds for length 2
        at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
        at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
        at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
        at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckFromToIndex(Preconditions.java:112)
        at java.base/jdk.internal.util.Preconditions.checkFromToIndex(Preconditions.java:349)
        at java.base/java.lang.String.checkBoundsBeginEnd(String.java:4602)
        at java.base/java.lang.String.substring(String.java:2715)
        at MicroBlog.truncate(MicroBlog.java:3)
        at MicroBlogTest.englishLanguageShort(MicroBlogTest.java:12)
MicroBlogTest > englishLanguageLong PASSED
2 tests completed, 1 failed
FAILURE: Build failed with an exception.Code language: JavaScript (javascript)

Note the nicest of errors in my opinion. Lots of information overload.

The crux of the issue is that back in test #1 we have provided an input of length 2 (string: “Hi”) and then asked for the substring(0, 5). The string length is only two, our index position starts at 0 (or “H”), and then goes to 1 (“i”), and then anything after that is out of bounds / doesn’t exist.

Right.

So some kind of conditional is seemingly required whereby if the length of the input string is >= 5, then use 5. Otherwise use the length of the input as the max length?

Let’s try that:

class MicroBlog {
    public String truncate(String input) {
        return input.substring(0, input.length() >= 5 ? 5 : input.length());
    }
}Code language: JavaScript (javascript)

That passes both of the tests.

But remember, red, green, refactor.

IntelliJ has a suggested refactoring here:

class MicroBlog {
    public String truncate(String input) {
        return input.substring(0, Math.min(input.length(), 5));
    }
}Code language: JavaScript (javascript)

The Math.min() function takes two arguments and returns the smaller of the two. So, in this expression:

If the length of the input is less than 5 characters, it will return the length of input.
If the length of the input is 5 or more characters, it will return 5.

I’m not finding that the easiest thing to read, and I don’t think the following refactoring improves it that much:

class MicroBlog {
    public String truncate(String input) {
        int usableLength = Math.min(input.length(), 5);
        return input.substring(0, usableLength);
    }
}Code language: Java (java)

Let’s stick with that, for the moment, and keep moving.

Test #3 Through 8 – The False Sense Of Security

Without further work, the following tests also pass:

    @Test
    public void germanLanguageLong_bearCarpet_to_beards() {
        String expected = "Bärte";
        assertThat(microBlog.truncate("Bärteppich")).isEqualTo(expected);
    }

    @Test
    public void bulgarianLanguageShort_good() {
        String expected = "Добър";
        assertThat(microBlog.truncate("Добър")).isEqualTo(expected);
    }

    @Test
    public void greekLanguageShort_health() {
        String expected = "υγειά";
        assertThat(microBlog.truncate("υγειά")).isEqualTo(expected);
    }

    @Test
    public void mathsShort() {
        String expected = "a=πr²";
        assertThat(microBlog.truncate("a=πr²")).isEqualTo(expected);
    }

    @Test
    public void mathsLong() {
        String expected = "∅⊊ℕ⊊ℤ";
        assertThat(microBlog.truncate("∅⊊ℕ⊊ℤ⊊ℚ⊊ℝ⊊ℂ")).isEqualTo(expected);
    }Code language: Java (java)

However, for me at least, programming is quite a humbling experience most of the time.

It’s not too long before…

Test #9: englishAndEmojiShort

Now we get to the challenging part of this exercise.

    @Test
    public void englishAndEmojiShort() {
        String expected = "Fly 🛫";
        assertThat(microBlog.truncate("Fly 🛫")).isEqualTo(expected);
    }Code language: Java (java)

This one is deliberately deceptive.

It’s 5 characters, right?

F
l
y
(space)
🛫

1,2,3,4,5…

What gives?

Well, let’s look at the test output:

Expected :"Fly 🛫"
Actual   :"Fly ?"Code language: JavaScript (javascript)

It’s interesting that, when printed here, the output of Actual converts the unknown character to a ?

Look at the screenshot of the terminal output and it’s not identical:

Expected :"Fly 🛫"
Actual :"Fly ?"
<Click to see difference>

That unusual symbol, I’m not sure of the correct name for it, represents that the string got truncated unexpectedly.

When I encountered this, I wanted some proof of this difference between a regular English letter, and an emoji. I’m not convinced this is the most scientific method, but this is what I did:

All I’ve done there is put a breakpoint in my code after adding a few variables in.

I can see that the length of the 🛫 is 2, not 1.

OK, so the Readme basically hinted at this.

I can also see that the 🛫 is made up of four bytes, whereas the letter A is only one byte.

Quickly recapping the important part of the readme:

ASCII can encode English language characters. All characters are precisely 1 byte long.
UTF-8 is a Unicode text encoding. Characters take between 1 and 4 bytes.
UTF-16 is a Unicode text encoding. Characters are either 2 or 4 bytes long.

Right, OK.

So the gist of this is that a naive substring is cutting off part of the characters that actually make up an emoji.

But how to fix this?

OK, got to be honest, as a Java newb when I looked at these I wasn’t seeing anything that immediately seemed to solve my problem.

The next thing I tried was this:

java index out of bounds on codepoint at with emoji

Interestingly then, I can prove that the 🛫 emoji is two distinct codepoints.

Does that help me?

Well, I saw previously that the outcome of:

int thing = "🛫".length(); // equals 2Code language: JavaScript (javascript)

Can this knowledge be used, somehow?

Get The Amount Of Codepoints, Not The Raw String Length

I confess I fluked the next ‘breakthrough’.

The seemingly obvious part, at this point, was that the string of “Fly 🛫” looked like 5 characters, but was actually six.

Well, not six characters, but rather six codepoints.

Only two of the codePoints... methods that IntelliJ was surfacing seemed like possible candidates:

codePoints() – returning an IntStream
codePointCount(int beginIndex, int endIndex) – returning an int

With no idea what an IntStream is, I opted for the second one.

beginIndex seemed obvious – 0.

But what about endIndex?

I reverted to basically putting in some stuff, and running the code, and taking a peek using the Debugger to see what was going on.

Actually I was surprised by the result before I ran the code. I expected c3 to report as an out of bounds error inside IntelliJ. But apparently not. It ran, as above, and gave me the magic number of 5.

OK, I can work with this!

Offset By Code Points

Armed with the knowledge above that codePointCount(0, 5) would return 5, I updated the usableLength to use that, instead.

class MicroBlog {
    public String truncate(String input) {
        int usableLength = Math.min(input.codePointCount(0, 5), 5);
        return input.substring(0, usableLength);
    }
}Code language: JavaScript (javascript)

This looked ugly, but felt like progress.

Red, green, refactor, right?

Expected :"Fly 🛫"
Actual   :"Fly ?"
<Click to see difference>Code language: JavaScript (javascript)

Wrong.

Still red.

OK, so what had I proved?

englishAndEmojiShort exercism java problem

Seemingly not much.

I had found a ‘better’ way to reach a usableLength of 5.

Arghh.

However, I was now convinced that 5 was the correct number to be using here. That sounds ridiculously obvious in hindsight, and perhaps I could have saved a lot of time by hardcoding the value in there for this test.

But as I said earlier, I frequently find programming to be a very humbling experience 😀

The issue, then, appeared to be back to substring cutting off ‘half’ my emoji’s codepoint.

It took a bit more headscratching, and what felt like a frustratingly long period just sat staring at the available methods in the IDE:

After exhausting the top four, the only one left was offsetByCodePoints, which didn’t really seem to be that understandable to me:

java offset by codepoints method description

It feels very much like one of those docblocks that makes sense if you already know what it does, but to a newb it might as well be in hieroglyphics.

So I just hacked at it 🤣:

And that was the eureka moment for me:

java dump out by codepoints debug output

Indexes 0 through 4 all behave as expected. They are either English characters or a whitespace.

The emoji, however, takes up the two characters, so the offset jumps by one.

Basically if we use the input.length() we get back a string length of 5. Wrong.

If we use the offsetByCodePoints(0, 5) we will get back 6. That’s our four basic, easy to understand characters, and the two required for the emoji. It’s done the work for us.

class MicroBlog {
    public String truncate(String input) {
        return input.substring(0, input.offsetByCodePoints(0, 5));
    }
}Code language: Java (java)

That change passes all by the very first test, which now fails.

So, somehow, we need to incorporate that usableLength idea.

offset by codepoints index out of bounds

Bringing back in the usableCount line we had from the first half of this exploration, we get:

class MicroBlog {
    public String truncate(String input) {
        int usableLength = Math.min(input.codePointCount(0, 5), 5);
        return input.substring(0, input.offsetByCodePoints(0, usableLength));
    }
}Code language: Java (java)

And running the full test suite, the only fail gives us the error of:

java.lang.IndexOutOfBoundsException: Range [0, 5) out of bounds for length 2Code language: HTTP (http)

Which points to that hardcoded 5.

If that usableLength instead uses the dynamic input.length():

int usableLength = Math.min(input.codePointCount(0, input.length()), 5);Code language: Java (java)

Then finally we get to a fully green test suite!

Tests #10 – 12

No need to do anything further, the implementation from Test #9 is sufficient to pass the remaining test cases.

I figure these two extra tests may help clear up any remaining confusion:

    @Test
    public void testEmojiCodePointCount() {
        String emoji = "💇";
        int expectedCodePointCount = 1;

        int actualCodePointCount = Character.codePointCount(emoji, 0, emoji.length());

        assertThat(expectedCodePointCount).isEqualTo(actualCodePointCount);
    }

    @Test
    public void testCharacterCount() {
        String emoji = "💇";
        int expectedCharacterCount = 2; // Expecting 2 individual Java 'char' elements

        int actualCharacterCount = emoji.length();

        assertThat(expectedCharacterCount).isEqualTo(actualCharacterCount);
    }Code language: JavaScript (javascript)

Wrapping Up

An exercise marked as “Easy” and it took me all this work!

Well, in truth, they are only easy if you know the answers. And as a Java newb, this was very much a learning experience for me.

In Java, strings are represented as sequences of characters. However, when dealing with internationalisation and characters from various languages, a single character may not correspond to a single code point.

A code point is the smallest unit of text in Unicode and can represent characters or symbols. Some characters, especially emoji and special symbols, can be made up of multiple code points.

By using the length of characters instead of the length of code points, you may run into issues when trying to truncate strings containing characters that require multiple code points.

To correctly truncate such strings, you should use the code point count, which is what the input.codePointCount(0, input.length()) method does. This accurately represents the number of code points in the string and ensures that you won’t encounter index out of bounds errors when truncating.

Exercism Readme

Instructions

Text Encodings

Initial Thoughts

Test #1: englishLanguageShort

Test #2: englishLanguageLong

Test #3 Through 8 – The False Sense Of Security

Test #9: englishAndEmojiShort

Get The Amount Of Codepoints, Not The Raw String Length

Offset By Code Points

Tests #10 – 12

Wrapping Up

Like this:

Leave a ReplyCancel reply

Exercism Readme

Instructions

Text Encodings

Initial Thoughts

Test #1: englishLanguageShort

Test #2: englishLanguageLong

Test #3 Through 8 – The False Sense Of Security

Test #9: englishAndEmojiShort

Get The Amount Of Codepoints, Not The Raw String Length

Offset By Code Points

Tests #10 – 12

Wrapping Up

Share this:

Like this:

Leave a ReplyCancel reply