You’re probably thinking “Wait a moment, you really wrote a post about this ancient technology in 2012?” Yeah, I know, I use WhatsApp too, and not only because it is based on an Erlang backend. So, if you’re already an expert on the matter, you can stop reading now, I won’t take it personally ;)
A few weeks ago, I read some articles regarding the problem of SMS’ cost, when sending texts with accented caps. One of the article’s heading was something like “Beware the accents, they are worth 70 characters”. Come again please, a single letter that takes 70s? That must be quite some hype, right? When I owned my old faithful Nokia, I simply used to ignore orthography writing, for example, è or E’ instead of È and so forth. Now that my Android handset has an almost QWERTY keyboard, I try to write correctly, so I knew it was time to understand what was really going on. Follow me, it will only takes some very basic maths.
Texts are encoded using a character set called 03.38. With this encoding, your mobile may choose between 3 different encodings: 7 bit, 8 bit or 16 bit. The 7 bit one is the default GSM encoding, we’re not interested in the 8 bit one, while the 16 bit corresponds to UTF-16 alphabet. In some older documents you may read UCS2 instead of UTF-16, this is simply due to the fact that UTF-16 is the successor of UCS2. One more fact, and then you’ll be needing your trusted calculator. An SMS is long at most 140 octets. Hey you nerd, I only want my 160 characters! Don’t worry, here they are:
- 7 bit encoding: 140 * 8 / 7 = 160 characters
- 16 bit encoding: 140 * 8 / 16 = 70 characters
So, if you write a character not included in the 7 bit alphabet, like the infamous È, your phone will (probably) silently switch to the UTF encoding, which explains the reduced available text.
Long story short: will an accented character eat up 70 precious characters? No way :)
Reference: 3GPP Alphabets and language-specific information