Unicode has overtaken ASCII as the most popular character encoding scheme on the World Wide Web. Also vanquished at almost exactly the same time was the Western European encoding.

Unicode is a character encoding standard that accommodates dozens of languages as well as Roman characters with diacritical marks. ASCII, a tried-and true, decades-old standard, is limited to 128 or 256 characters and has a hard time extending beyond the range of a century-old Remington typewriter.

Mark Davis, Google's senior international software architect, said in a blog post that Unicode vanquished ASCII and Western European within 10 days in December.

"What's more impressive than simply overtaking them is the speed with which this happened," he added, pointing to a graph showing the meteoric rise of Unicode.

Google's a fan of Unicode Web sites. When it processes data from Web sites, it converts it into Unicode first if it's not already there. That improves international search abilities.

"The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover," he said.

Google just converted to Unicode 5.1, he added, "so people speaking languages such as Malayalam can now search for words containing the new characters," he said.

One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character because Unicode uses more bytes to enumerate its vastly larger range of alphabetic symbols.

Related links

Comments

1

Brett Zamir - 02/10/08

Hi,

Thank you for passing on this great bit of news.

The last statement in the article is not correct, however. Unicode only takes twice as much memory for the Roman alphabet if the UTF-16 encoding is being used (an encoding useful primarily for representing East Asian texts), while the chart shows the popular form making headway is UTF-8, which deliberately represents ASCII characters not only with the same amount of memory as ASCII, but even the same nature of encoding--so an ASCII file can serve as a UTF-8 file. However, it is true that accented characters in some European languages may take up more memory than the Latin-1 encoding. Beyond any headaches in migration, there's no justifiable reason not to move to UTF-8, including for memory (thus the popularity), and certainly no reason not to use it in new projects.

» Report offensive content

Leave a comment

You must read and type the 6 chars within 0..9 and A..F

* indicates mandatory fields.

1

Brett Zamir - 10/02/08

Hi, Thank you for passing on this great bit of news. The last statement in the article is not correct, however. Unicode only ... more

Log in


Sign up | Forgot your password?

  • Staff Microsoft shows off IE9 preview

    This week, highlights from Microsoft's MIX10 conference and more in the Roundup. Read more »

    -- posted by Staff

  • Chris Duckett IE9's H.264 vote killed Ogg

    In a split decision by the judges, the winner of the W3C/WHATWG video codec consensus is H.264, taking home the future of video playback on the internet while loser Ogg goes home with nothing but thoughts of what might have been. Read more »

    -- posted by Chris Duckett

  • Staff Google launches Apps Marketplace

    Google launches and app store, while Mozilla plans to re-write its open-source license. More of this week's news in the Roundup. Read more »

    -- posted by Staff

What's on?

  • Optus Deal

    Broadband + home phone + PlayStation®3 in a single package price!