We have already seen how to store positive and negative integers, and fractional numbers. But our programs won’t only have numbers. Usually, they will also have text.
The representation of text has been one of the headaches of computing since its beginnings. We saw this when talking about Bytes and Char, text encoding was a real problem to solve from the first computers.
Characters or “letters” are a set of symbols we use to communicate. They are a fairly extensive set, including uppercase letters, lowercase letters, punctuation marks, and the digits of the decimal system itself.
Fortunately, we have largely overcome this, and you will almost never have to worry about it too much. But, it’s good for you to know how they are encoded and how to work with them (because you will surely need it at some point).
When talking about numbers, we could play with representation forms. In the end, it was a change of base. But when talking about characters, there is no other option but to use a translation table (THIS binary number corresponds to THIS letter).
So this was the problem they encountered when the first computers were beginning to be developed. They said, how big do I need that table to be? How many binary digits do I need?
And thus the ASCII table was born 👇
ASCII Representation
ASCII (American Standard Code for Information Interchange) is an encoding standard dating back to 1963, which assigns a unique integer to each character in the basic set of English characters.
Each ASCII character is represented by a 7-bit numeric value, allowing for a total of 128 different characters.
For example, the character ‘A’ has an ASCII value of 65, which is represented in binary as 01000001.
Extended ASCII table
The ASCII table was very limited in terms of characters. Fortunately, in computing it was already normal for a Byte to be 8 bits. Of these, ASCII only used 7 bits, so there were another 128 characters left to expand it.
The extended ASCII table is an extension of the ASCII standard that increases the number of characters to 256 (from 128 to 255). This includes additional characters such as accented letters, special symbols, letters and characters used in languages other than English (such as Spanish, French, German, among others).
The extended ASCII table is not a single official standard, but there are several variants that assign different characters to codes 128 to 255. Some of the most common variants are ISO 8859-1 (also known as Latin-1), ISO 8859-15 (Latin-9), Windows-1252, among others.
Unicode
As computing became more global, the ASCII character set (not even the extended version) proved insufficient to represent all the characters used in different languages and writing systems.
To address this limitation, Unicode was developed, an encoding standard that assigns a unique code to each character used in any language in the world.
For example, the character ’✓’ has a Unicode value of U+2713, which is represented in binary as 0010 0111 0001 0011
Unicode uses a 16-bit (or more) representation for each character, which allows the representation of a much wider set of characters. For compatibility, the first 128 Unicode characters are identical to the ASCII character set.
Currently, the Unicode table has about 150 thousand encoded characters. That means we have also exhausted the 16 bits (which would go up to 65,536 characters). And this is where UTF comes into play.
UTF Encoding
Unicode and UTF (Unicode Transformation Format) are closely related, but they are different concepts:
Unicode: It is a character encoding standard that assigns a unique number to each character in almost all known writing systems in the world, including letters, numbers, symbols, and special characters. For example, the letter “A” has a unique number in Unicode, as does any other character you can imagine.
UTF (Unicode Transformation Format): It is a way of encoding Unicode code points into byte sequences. UTF defines how these Unicode code points are stored in a computer’s memory or transmitted over a network.
There are several variants of UTF, such as UTF-8, UTF-16, and UTF-32, which differ in how they represent Unicode characters as byte sequences.
Now, regarding the number of bytes Unicode uses:
UTF-8: It is the most common and widely used. In UTF-8, each Unicode character is represented using 1, 2, 3, or 4 bytes. ASCII characters (the first 128 characters of Unicode) are represented with 1 byte in UTF-8, which means it is compatible with ASCII. Additional Unicode characters use more bytes depending on their range.
UTF-16: Each Unicode character is represented in UTF-16 using 2 or 4 bytes. Unicode characters in the “BMP” (Basic Multilingual Plane) range are represented with 2 bytes, while characters outside the BMP use 4 bytes.
UTF-32: It is the simplest format, as it assigns each Unicode character exactly 4 bytes. This means that any Unicode character, regardless of its range, will be represented with 4 bytes in UTF-32.
In summary, the number of bytes a Unicode character occupies depends on the UTF format being used:
- UTF-8: 1 to 4 bytes per character
- UTF-16: 2 or 4 bytes per character
- UTF-32: Always 4 bytes per character
Therefore, the answer to how many bytes Unicode has depends on the UTF format being used to encode it.
