Here, you can write text and convert it to an image — and back again!
Click on "Encode Text" in the navigation bar at the top of the page to try it.
The text is compressed with Huffman coding and then converted to a PNG that you can send to people. Once they get the image, they can decode it back into your message.
Aside from some pretty niche use cases (like fulfilling my childhood desire to send secret* messages to my friends), this website isn't
especially useful; however, it's a good excuse to explore some very interesting computer science, which I hope to do on this page.
As an added benefit, I find the images the website generates to be quite fun to look at. It's not often you get to see information packed
so densely: for example, the entirety of Shakespeare's The Tragedy of Hamlet neatly fits into a single 191x191 pixel image.
*Although the text is compressed, it isn’t encrypted. Don’t encode any compromising secrets or other sensitive information.
This page is roughly broken up into three parts: Text Compression with Huffman Codes, Canonicalizing Huffman Trees, and Encoding Information in an Image. Throughout the page, I will explain the algorithms I used and the design decisions I made along the way.
Our first goal is to compress the input text into a binary file. Normally, computers use one byte to store each character, following an encoding standard called ASCII. Using 8 bits for every character is useful because the nth character is at the nth byte in the binary; however, this standard can waste a lot of space. For instance, if the input text only had two unique characters, we could've just assigned "0" to the first character and "1" to the second character, thus using 1/8 of the space to encode the same information. While coming across text with only two unique characters is unlikely, our goal is to find an algorithm that can take advantage of the number and frequency of the characters in the text to come up with an encoding scheme whose efficiency is better than 8 bits/character.
Our first order of business is assigning the characters in the text a unique binary representation. Later, we'll substitute each character for its
binary representation to create our binary file. Although we could assign each character an equal number of bits, it's more efficient
(storage-wise) to link the most common characters with the shortest binary representations. Huffman coding does just that. It's actually
theoretically* the most efficient way (bits/character) to encode characters.
*methods like arithmetic coding can achieve even higher compression ratios by dropping the "symbol-by-symbol" restriction.
To create the tree, we first order the characters by their frequency, greatest to least. We combine the two least frequent characters to form a node that represents both of them (in the animation, this is represented with a node labeled with the concatenation of the two characters). We give this new node a frequency equal to the sum of the frequencies of the two component characters. Lastly, we sort this new node among the rest of the character nodes and repeat the process until there is only one node left. This last node is the entire tree.
After the tree is built, to find the binary representation of a character, we start at the root (top) node and traverse down the tree until we reach the character. At each junction, down and left represents a 0, and down and right represents a 1.
At this point, you may be wondering why we can't just count upward in binary assigning each number to a different character in the order of their frequency. The issue is that when decoding the binary, there are no delimiters, like spaces or commas, to tell when a character ends or the next one begins. By using a tree like the one we just created, we can start at the root of the tree, traversing right and left depending on whether we encounter a 0 or 1, and when we reach a leaf node (a node with no descendants), we know to stop as we've found the character we were looking for. From there, we can repeat the process and continue decoding the next character from the root.
While the tree in its current form is exciting and all, we still need to figure out a way to communicate the tree's structure and contents to the recipient of the binary. That way, they can recreate the tree and use it to decode the message. A naïve approach is to just list out each character followed by its respective binary representation. Another approach is to list all the nodes and their relationship to the nodes immediately below them; however, we can do much better space-wise. What we'd like is a standard way to organize the tree so that the decoder can reconstruct the tree with as little information as possible.
Enter canonical Huffman codes. It turns out that if we arrange the tree in such a way that the lengths of the branches always increase from left to right, we can communicate the tree's structure with only the list of characters and the length of their respective binary representations. You know how earlier I said that to create the binary representations of the characters, we can't just count up in binary because there aren't any delimiters? Well, we can use a strategic variation of that approach to create a canonical Huffman code. Since we already created a non-canonical Huffman code above, we can just traverse the tree, noting the lengths of the binary representations of each character as we come to it.
Armed with the character bit-lengths, we now have the necessary information to canonicalize. The first step is to sort* the characters by their
bit-length from least to greatest. We then assign the first character a binary representation of as many zeros as its length (e.g. "000" for
a character with a bit-length of 3). From there, we count up by 1 (in binary) for every following character of the same bit-length.
When we get to a character with a higher bit-length, we count up by 1 and then add zeros at the end until the length of the new binary
representation matches the bit-length of that particular character.
*In the actual implementation of this algorithm, I used a
min-heap to avoid sorting at every step.
Fewf! While this algorithm is certainly a mouthful, the intuition for the process is quite simple. By counting up by 1 every time, we ensure that every binary representation is distinct and well-ordered. When we get to a character whose binary representation requires additional bits, we add as many zeros as necessary to meet that requirement. This has the effect of placing the character where it would have gone on the tree if it hadn't required additional bits and then extending the branch (with "0s") to make space for other characters in the tree. Crucially, the bit-length of each character's binary representation stays constant. The algorithm just standardizes the specific binary representations used. Once we find the new representations, it's just a matter of recreating the tree. For each character's binary representation, we start at the tree root and traverse downward, creating new nodes, if need be, for every "0" and "1" we encounter.
Let's see it in action.
The most recent step where we recreated the tree from the bit-lengths is exactly the process used to decode the binary. That's the benefit of canonical Huffman encoding! Instead of sending all the characters and their corresponding binary representations, we just send the list of characters and their bit lengths. We leave the job of reconstructing the tree to the decoder.
The final optimization relates to how we represent the bit-lengths. While we could list the bit-lengths of all the characters in order, it's often more compact to instead record the number of characters for each bit-length. For example, the bit-length representation for the word "Example" is 0,1,6. There are no characters with bit-length one, "E" has a bit-length of 2, and the remaining 6 unique characters have a bit-length of 3.
Now that we've created our glorious canonical Huffman code, we need to figure out a way to send our message to people. Sending the binary in a text file would be disastrous for two reasons. Firstly, it would be extremely boring, and secondly, text files are normally encoded with ASCII meaning each number would use 7 extra bits for no reason. Of course, we could just send a binary file; however, to view its contents properly you'd need a hex editor, only to be disappointed by a screen of ones and zeros. The solution, perhaps unsurprisingly, involves pretty colors.
We can dump the binary into an image with a 24-bit RGB color format. Each pixel has a red, green, and blue brightness value which ranges from 0-255. By breaking up the binary into pixel-sized, 24-digit-long chunks, we can convert the entire message into an image. But we're not done yet! We also need to include the list of characters and their bit-lengths so that the recipient can decode the binary. We encode the lists of characters and bit-lengths as UTF-8 byte arrays, comma delimiting the bit-lengths.
Now we can put it all together. The first three pixels of the image are devoted to encoding three numbers: the number of padding bits (the number of zeros added to the end of the binary to allow it to fill a square image), the number of bytes to represent the list of characters, and the number of bytes to represent the list of bit-lengths. The message binary goes immediately afterward in 24-digit-long chunks, followed by the padding bits.
The decoding process is nearly the same in reverse. We convert the first pixel's binary into base 10 to find the number of padding bits and chop off that many zeros from the end. After that, we use the next two pixels to find out the length of the character and bit-length lists and decode them into their text form. Next, as described in part 2, we use the character and bit-length lists to reconstruct the relevant canonical Huffman tree and then use the tree to decode the rest of the binary back into text.
That's all there is to it.
Congratulations, that was quite a journey. While I don't expect anyone to perfectly remember how canonical Huffman codes work, I hope that this example offered some level of insight that may be applied elsewhere. If you code, I encourage you to investigate other compression methods like Asymmetric numeral systems that can achieve higher compression ratios. If you're not a coder but found this page interesting, I encourage you to give coding a shot. There are a lot more algorithms where this one came from.