This text is not intended as a tutorial to help in learning the Korean alphabet. Instead, it is written for web designers and other people who need a superficial knowledge how to encode Korean characters in an HTML document. Basic knowledge of Unicode character representation and HTML character references is necessary to understand this text. If you don’t know about Unicode, Character Sets, Encodings and I18n, please read some tutorial, e.g., Character Set Issues by A.J. Flavell.
The Hangul are composed of letters (jamo, 자모) in a rather systematic way. The Jamo represent sounds similar to the way how Latin letters represent sounds. Although there are 11172 different Hangul, their individual appearances need not to be memorized; rather, one has to learn the 68 different Jamo shapes and the rules governing the construction of Hangul from Jamo. Moreover, even the Jamo shapes are not arbitrary: For example, the Jamo ㅆ (SS) looks like a duplicated ㅅ (S), and the Jamo ㄵ (NJ) is a juxtaposition of ㄴ (N) and ㅈ (J).
Some Jamo shapes are obviously mnemonic, e.g., ㅁ (M) which symbolizes a closed mouth articulating the sound /m/. It is now fairly established that this mnemonic character applies to all Jamo consonants even if they look rather arbitrary: The key is the shape and position of the tongue when producing the consonant sound.
There are 19 different lead consonants, including the mute consonant. The following table gives these consonants in their canonical order, and their Unicode values. Consonant number 12 is the mute consonant.
The vowels number 21 in Korean. Note that some of these vowels would be classified as diphthongs in other languages; also, some vowels contain an Y as part of the vowel.
The total number of tail consonants is 27; some of them are very rarely used in modern Korean. Most of the tail consonants can also appear as leads; the Jamo for these consonant pairs look very similar. Tail consonant 21 ㅇ (NG) corresponds to the mute lead consonant ㅇ (12). This correspondence is purely graphical and has no deeper meaning; some fonts will distinguish the two characters by a small vertikal stroke attached to the NG letter, while other will render them identically as a (squashed) circle.
Actually, Unicode has another range of Jamo characters called Hangul Compatibility Jamo, starting at U+3100. These represent the same letters, but they have no conjoining behaviour as described in the next section. The compatibility Jamo are rather unlikely to appear in a real Korean document, but they can be used if isolated Jamo must be shown in a text, for example an instruction of the writing system (like the one you are currently reading).
In this document, I use compatibility Jamo almost everywhere, as they tend to render more cleanly. In the tables on the right side,
each Jamo is given twice: First as conjoinig Jamo and then as compatibility Jamo, while the hexadecimal codepoint refers to the former.
The differences (if any) can be demonstrated here: Jamo GG ᄁ (head) and ᆩ (tail) and Compatibility Jamo GG ㄲ. What you will
see depends on your operating system, your browser, your default font and even the font size. While the Compatibility Jamo will probably
look all right, the isolated conjoining Jamo may appear identical, smaller (and raised or lowered), overlapping their neighbours, or even empty.
Therefore, a Korean text can be seen as a sequence of Hangul, each of which represents one spoken syllable. In this view, Korean script would be seen as a syllabary comparable to East African Ge’ez script, and to lesser degree, Japanese (kana) scripts. On the other hand, one could understand Korean writing without reference to Hangul at all; according to this perspective, Korean writing is as alphabetical as the Latin script, but uses complicated typographic rules to determine the placement of any Jamo relative to its predecessor and successor.
The latter view also reminds to Indic scripts of the Brahmi family, where an arbitrary number of consonant signs plus a vowel are graphically combined into a syllable glyph; the combination rules for Indic scripts are, however, much more involved. It has been argued that the Indic model influenced the construction of the Hangul script via the Tibetean Phagspa script. Phagspa was a short-lived script and is now extinct; it is not the predecessor of the modern Tibeten script.
Canonical equivalence even extends to mixed cases of Hangul HWEO 훠 plus tail Jamo LH ㅀ, see here 훯 for a live example (support for this construction is much worse than for the previous). However, there is no canonical nor compatibility equivalence that would allow you to decompose a complex Jamo like LH into its constituents (L and H); therefore, you cannot repesent the HWEOLH syllable by something like Jamos H+WEO+L+H or by Hangul HWEOL plus tail Jamo H.
The common way of coding Korean text is to use the precomposed Hangul syllables that do not explicitly reference the underlying Jamo characters. Isolated Jamo are rarely found in Korean texts. The Unicode Standard assigns an individual code point to each Hangul. To calculate the code point of a Hangul from its Jamo components, the following formula may be used:
Code point of Hangul = tail + (vowel−1)*28 + (lead−1)*588 + 44032
In this formula, lead, vowel and tail refer to the small integer numbers given in the above tables (if there is no tail consonant, use the value 0). The Hangul syllabary occupies the Unicode range from AC00 (decimal 44032) to D7A3 (decimal 55171). In UTF-8, each Hangul needs three bytes (the same is also true for the Jamo, which is another reason why they are almost never used for encoding Korean text).
In the other direction, the phonetic value of a Hangul can be calculated from its code point. It is convenient to use the modulo function mod(a,b), which yields the remainder of the quotient a/b, and the integer function int(a) which yields the integer part of a.
tail = mod (Hangul codepoint − 44032, 28)
vowel = 1 + mod (Hangul codepoint − 44032 − tail, 588) / 28
lead = 1 + int [ (Hangul codepoint − 44032)/588 ]
To illustrate the formulae, let us consider the writing of the words jamo and hangul in Hangul. The Hangul neccessary are called JA, MO, HAN and GEUL in Unicode.
|lead consonant||J ㅈ (13)||M ㅁ (7)||H ㅎ (19)||G ㄱ (1)|
|vowel||A ㅏ (1)||O ㅗ (9)||A ㅏ (1)||EU ㅡ (19)|
|tail consonant||– (0)||– (0)||N ㄴ (4)||L ㄹ (8)|
|Hangul code point (dec)||51088||47784||54620||44544|
|Hangul code point (hex)||C790||BAA8||D55C||AE00|
As an inverse problem, we now analyse the two Korean words 서울 and 평양:
|Hangul code point (hex)||C11C||C6B8||D3C9||C591|
|Hangul code point (dec)||49436||50872||54217||50577|
|Code point − 44032||5404||6840||10185||6545|
|tail consonant||– (0)||L ㄹ (8)||NG ㅇ (21)||NG ㅇ (21)|
|vowel||EO ㅓ (5)||U ㅜ (14)||YEO ㅕ (7)||YA ㅑ (3)|
|lead consonant||S ㅅ (10)||– ㅇ (12)||P ㅍ (18)||– ㅇ (12)|
So the the two words actually stand for the capitals of South and North Korea, respectively. These are usually rendered in Latin script as Seoul and Pyeongyang, although other romanizations are possible (e.g., Sŏul and P’yŏngyang or Pyongyang).
Note that Unicode contains also obsolete or archaic Jamo that are absent from standard writing. They might still be used in reproducing historical texts, writing Korean dialects or transcribing Chinese words. Unicode does not offer precomposed hangul with these rare letters; instead, syllables containing them must be coded by jamo letters (if only the tail of a syllable is archaic, then the mixed representation with an open-syllable hangul followed by the archaic tail is also possible). An example is the archaic Z ㅿ appearing in the hangul ZIZ ᅀᅵᇫ or GOZ 고ᇫ or 고ᇫ (unlikely to render correctly)
The formulae in the previous section should enable you to analyze a Hangul encountered in some dark corner, or to construct a Hangul out of its Unicode name. To illustrate the formulae, and to make things easier to use, I offer a Hangul Construction Form that allows you to create Hangul from their components, or to decompose them. You might also construct the Hangul for fun, or to study how the visual appearance depends on the size and shape of the Jamo constituents.
You can select constituent Jamo (by Unicode name) from the dropdown menues, or enter data into the text fields. The text fields will accept either valid Unicode names of the right kind (e.g., "N" in the first, "YE" in the second, "LB" in the third or "BBWEOBS" in the last field) or true Korean characters of the right kind (combining or compatibility Jamo in the first three fields, Hangul in the last). In the course of the calculation, text field contents will be normalized into true Korean characters (actually, Compatibility Jamo in the first three fields), irrespective of the type of input. Input parsing is tolerant on case and blanks, but make sure you delete previous field contents.
Clicking the codepoint will display the current form data permanently in a table, which is useful if you wish to study the Hangul shapes in a series with varying Jamo components. Jamo shown are Compatibility Jamo, but Jamo codepoints refer to true combining Jamo.
romanized), the question arises whether the spoken or the written language should be followed in devising a romanization scheme. Both appoaches have their merits, and both are actually used.
The Unicode names of Jamo and Hangul closely follow the Revised Romanization of Korean, which has official status in Korea. The revised system basically maps each Jamo to one letter (or a polygraph) of the Latin alphabet, thus creating a faithful representation of written Korean in Latin script. It does not, however, represent the actual pronunciation very well.
Outside of Korea, the older McCune-Reischauer System continues to be popular. It takes into account certain assimilation phenomena that occur on syllable boundaries, and thus comes closer to the actual pronunciation of Korean. On the other hand, McCune-Reischauer romanizations cannot be constructed trivially from sequences of Jamo. The remainder of this section will describe the procedure briefly.
The vowels are transliterated in a rather straightforward way. As special characters, u and o with the breve accent (ŭ, ŏ) are used to denote short reduced vowels. The diacritics are often omitted when publishing on the web.
The romanization of consonants is significantly more involved. The complication arises from the fact the tail of any syllable may be assimilated to the lead of the following syllable. Therefore, there are no separate pronunciations of the tail and the lead of consecutive syllables, but both of them are to be pronounced as one single unit. The McCune-Reischauer romanization acknowledges this fact by also romanizing them as a unit. The following table gives the transliteration of tail/lead combinations involving the most common Jamo.
The apostrophe is used to disambiguate N’G (sequence of Jamo N + Jamo G) from NG (Jamo NG) and to mark aspirated plosives. This sign is often omitted from documents published on the web.
To illustrate the use of that table, we will romanize the National
Motto of South Korea 널리 인간을 이롭게 하라
Bring benefit to all people
using the McCune-Reischauer System.
Note that in the combination GAN-EUL (간을), the second hangul starts with an empty lead, thus the transcription value for the N must be taken from the first column of the table. Likewise, in I-ROB (이롭), the first hangul has no tail which means that the penultimate row applies. This is admittedly complicated, but not unreasonably so.
Note: Some browsers have problems displaying this large table. Therefore, its display is now switched off. You may enable it at your own risk (this can take minutes for some versions of the Chrome browser, though Gecko and Safari need only a few seconds).