網域名稱俱樂部


返回   網域名稱俱樂部 > 網域名稱討論 > IDN網域名稱

回覆
 
主題工具
  #1  
舊 2012-02-07, 12:50 AM
best-url 的頭像
best-url best-url 目前離線
站務管理
 
註冊日期: 2003-08-11
住址: IDN Club
文章: 9,925
發送 MSN 消息給 best-url
預設 網路/網站 使用 Unicode 編碼 的現況

網路/網站 使用 Unicode 編碼 越普遍, 越有助於 中文(IDN)域名 的發展 -

引用:
Unicode over 60 percent of the web

2/03/2012

Computers store every piece of text using a “character encoding,” which gives a number to each character. For example, the byte 61 stands for ‘a’ and 62 stands for ‘b’ in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding, C1 could mean any of ¡, Ё, Ą, Ħ, ‘, ”, or parts of thousands of characters, from æ to 品. If you brought a file from one computer to another, it could come out as gobbledygook.

Unicode was invented to solve that problem: to encode all human languages, from Chinese (中文) to Russian (русский) to Arabic (العربية), and even emoji symbols like or; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn’t even enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the World Wide Web debuted—little did anyone realize at the time they would be so important for each other. Today, people can easily share documents on the web, no matter what their language.

Every January, we look at the percentage of the webpages in our index that are in different encodings. Here’s what our data looks like with the latest figures*:



*Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data.

As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

We’ve long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we’ll be updating to that version and to Unicode’s locale data from CLDR 21 (both via ICU). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it’d be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language.
http://googleblog.blogspot.com/2012/...nt-of-web.html

此篇文章於 2012-02-07 12:58 AM 被 best-url 編輯。
回覆時引用此篇文章
回覆

主題工具

發文規則
不可以發表新主題
不可以發表回覆
不可以上傳附件
不可以編輯自己的文章

啟用 BB 代碼
論壇啟用 表情符號
論壇啟用 [IMG] 代碼
論壇禁用 HTML 代碼



所有時間均為 +8。現在的時間是 12:00 PM


本站主機由網易虛擬主機代管
Powered by vBulletin® 版本 3.8.4
版權所有 ©2000 - 2024,Jelsoft Enterprises Ltd.