We have already seen how to store positive and negative integers, and fractional numbers. But our programs will not only have numbers. Generally, we will also have texts.
The representation of texts has been one of the headaches of computing since its origins. We saw it when talking about Bytes and Char; text encoding was a real problem to solve since the first computers.
Characters or “letters” are a set of symbols that we use to communicate. They are a quite extensive set, which includes uppercase letters, lowercase letters, punctuation marks, and the digits of the decimal system.
Fortunately, we have already overcome this issue, and you should almost never have to worry about it too much. However, it is important for you to know how they are encoded and how to work with them (because you will surely need it at some point).
When talking about numbers, we could play with the representation form. In the end, it was a change of base. But when talking about characters, there is no choice but to use a translation table (THIS binary number corresponds to THIS letter).
So this problem was encountered when the first computers began to be developed. They asked, how big do I need that table to be? How many binary digits do I need?
And thus the ASCII table emerged 👇
ASCII Representation
ASCII (American Standard Code for Information Interchange) is a coding standard dating back to 1963, which assigns a unique integer to each character in the basic English character set.
Each ASCII character is represented by a 7-bit numerical value, allowing for a total of 128 different characters.
For example, the character ‘A’ has an ASCII value of 65, which is represented in binary as 01000001.
The first 31 characters of the ASCII table are control characters, which are interpreted by the computer.
Dec | Char |
---|---|
0 | NUL (null) |
1 | SOH (start of heading) |
2 | STX (start of text) |
3 | ETX (end of text) |
4 | EOT (end of transmission) |
5 | ENQ (enquiry) |
6 | ACK (acknowledge) |
7 | BEL (bell) |
8 | BS (backspace) |
9 | TAB (horizontal tab) |
10 | LF (NL line feed, new line) |
11 | VT (vertical tab) |
12 | FF (NP form feed, new page) |
13 | CR (carriage return) |
14 | SO (shift out) |
15 | SI (shift in) |
16 | DLE (data link escape) |
17 | DC1 (device control 1) |
18 | DC2 (device control 2) |
19 | DC3 (device control 3) |
20 | DC4 (device control 4) |
21 | NAK (negative acknowledge) |
22 | SYN (synchronous idle) |
23 | ETB (end of trans. block) |
24 | CAN (cancel) |
25 | EM (end of medium) |
26 | SUB (substitute) |
27 | ESC (escape) |
28 | FS (file separator) |
29 | GS (group separator) |
30 | RS (record separator) |
31 | US (unit separator) |
The remaining characters are letters and symbols, according to
Dec | Char |
---|---|
32 | SPACE |
33 | ! |
34 | ” |
35 | # |
36 | $ |
37 | % |
38 | & |
39 | ’ |
40 | ( |
41 | ) |
42 | * |
43 | + |
44 | , |
45 | - |
46 | . |
47 | / |
48 | 0 |
49 | 1 |
50 | 2 |
51 | 3 |
52 | 4 |
53 | 5 |
54 | 6 |
55 | 7 |
56 | 8 |
57 | 9 |
58 | : |
59 | ; |
60 | < |
61 | = |
62 | > |
63 | ? |
Dec | Char |
---|---|
64 | @ |
65 | A |
66 | B |
67 | C |
68 | D |
69 | E |
70 | F |
71 | G |
72 | H |
73 | I |
74 | J |
75 | K |
76 | L |
77 | M |
78 | N |
79 | O |
80 | P |
81 | Q |
82 | R |
83 | S |
84 | T |
85 | U |
86 | V |
87 | W |
88 | X |
89 | Y |
90 | Z |
91 | [ |
92 | \ |
93 | ] |
94 | ^ |
95 | _ |
Dec | Char |
---|---|
96 | ` |
97 | a |
98 | b |
99 | c |
100 | d |
101 | e |
102 | f |
103 | g |
104 | h |
105 | i |
106 | j |
107 | k |
108 | l |
109 | m |
110 | n |
111 | o |
112 | p |
113 | q |
114 | r |
115 | s |
116 | t |
117 | u |
118 | v |
119 | w |
120 | x |
121 | y |
122 | z |
123 | { |
124 | | |
125 | } |
126 | ~ |
127 | DEL |
These numerical values can be represented in binary, allowing ASCII characters to be processed efficiently by computers.
Extended ASCII table
The ASCII table was very limited in terms of characters. Fortunately, it was already common in computing for a Byte to be 8 bits. Of these, ASCII only used 7 bits, which left another 128 characters to expand it.
The extended ASCII table is an extension of the ASCII standard that increases the number of characters to 256 (from 128 to 255). This includes additional characters such as accented letters, special symbols, letters, and characters used in other languages besides English (such as Spanish, French, German, among others).
The extended ASCII table is not a single official standard, but there are several variants that assign different characters to the codes from 128 to 255. Some of the most common variants are ISO 8859-1 (also known as Latin-1), ISO 8859-15 (Latin-9), Windows-1252, among others.
Unicode
As computing became more global, the ASCII character set (not even the extended version) proved insufficient to represent all the characters used in different languages and writing systems.
To address this limitation, Unicode was developed, a coding standard that assigns a unique code to each character used in any language in the world.
For example, the character ’✓’ has a Unicode value of U+2713, which is represented in binary as 0010 0111 0001 0011
.
Unicode uses a representation of 16 bits (or more) for each character, which allows for the representation of a much broader set of characters. For compatibility, the first 128 Unicode characters are identical to the ASCII character set.
Currently, the Unicode table has around 150,000 encoded characters. This means we have also exhausted the 16 bits (which would go up to 65,536 characters). And this is where UTF comes into play.
Progressively, the Unicode system has been reviewed and expanded to introduce more and more (and more) characters.
Unicode 1.0 (1991): The first official version of Unicode, which included 24,000 characters.
Unicode 1.1 (1993): Added 10,000 additional characters.
Unicode 2.0 (1996): A major revision that added the ability to support bidirectional writing (like Arabic and Hebrew), in addition to adding another 35,000 characters.
Unicode 3.0 (1999): Incorporated a large number of additional characters to support languages such as Chinese, Japanese, and Korean, along with many other symbols and technical characters.
Unicode 3.1 (2001): Introduced minor changes and bug fixes.
Unicode 3.2 (2002): Included improvements in handling bidirectional writings and changes in encoding.
Unicode 4.0 (2003): Added more than 96,000 additional characters, including many ideograms for Asian languages.
Unicode 4.1 (2005): Introduced some technical improvements and new standards for encoding.
Unicode 5.0 (2006): Added around 6,000 additional characters, including many mathematical and technical symbols.
Unicode 5.1 (2008): A minor version with some corrections and clarifications.
Unicode 5.2 (2009): Added approximately 800 new characters, including characters for mathematics and minority languages.
Unicode 6.0 (2010): Introduced support for writing in emoji characters, in addition to adding many new characters.
Unicode 6.1 (2012): Added around 7,000 new characters, including many symbols for mathematics and music.
Unicode 6.2 (2012): Introduced support for Burmese and Kaithi characters, among others.
Unicode 6.3 (2013): Added support for Tibetan characters and some other improvements.
Unicode 7.0 (2014): Introduced around 2,834 new characters, including many for minority languages and symbols.
Unicode 8.0 (2015): Added approximately 7,716 new characters, including support for languages like Cherokee and Meitei Mayek.
Unicode 9.0 (2016): Introduced around 7,500 additional characters, including support for the new emoji standard.
Unicode 10.0 (2017): Added more than 8,500 new characters, including glyphs for Caucasian languages and emoji symbols.
Unicode 11.0 (2018): Introduced around 7,864 new characters, including glyphs for the Sindhi alphabet and additional emojis.
Unicode 12.0 (2019): Added more than 137,000 new characters, including support for the ancient Egyptian alphabet and many new symbols.
Unicode 12.1 (2019): A minor version with some corrections and improvements.
Unicode 13.0 (2020): Added around 5,930 new characters, including new emojis and symbols.
Unicode 14.0 (2021): Introduced around 5,280 new characters, including new emojis and characters from minority languages.
Unicode 15.0 (2022): Added 17,189 new characters, including new emojis, glyphs for African languages, and technical symbols.
UTF Encoding
Unicode and UTF (Unicode Transformation Format) are closely related but are different concepts:
Unicode: It is a character encoding standard that assigns a unique number to each character in almost all known writing systems in the world, including letters, numbers, symbols, and special characters. For example, the letter “A” has a unique number in Unicode, just like any other character you can imagine.
UTF (Unicode Transformation Format): It is a way of encoding Unicode code points into byte sequences. UTF defines how these Unicode code points are stored in a computer’s memory or transmitted over a network.
There are several variants of UTF, such as UTF-8, UTF-16, and UTF-32, which differ in how they represent Unicode characters as byte sequences.
Now, regarding the number of bytes that Unicode uses:
UTF-8: It is the most common and widely used. In UTF-8, each Unicode character is represented using 1, 2, 3, or 4 bytes. ASCII characters (the first 128 characters of Unicode) are represented with 1 byte in UTF-8, which means it is compatible with ASCII. Additional Unicode characters use more bytes depending on their range.
UTF-16: Each Unicode character is represented in UTF-16 using 2 or 4 bytes. Unicode characters that are in the “BMP” (Basic Multilingual Plane) range are represented with 2 bytes, while characters outside the BMP use 4 bytes.
UTF-32: It is the simplest format, as it assigns each Unicode character exactly 4 bytes. This means that any Unicode character, regardless of its range, will be represented with 4 bytes in UTF-32.
In summary, the number of bytes that a Unicode character occupies depends on the UTF format being used:
- UTF-8: 1 to 4 bytes per character
- UTF-16: 2 or 4 bytes per character
- UTF-32: Always 4 bytes per character
Therefore, the answer to how many bytes Unicode has depends on the UTF format being used to encode it.