DELIVERY & GIFT DETAILS:
Usually ships within 24 hours
Delivery Time and Shipping Rates
Eligible for gift wrap & gift message.
This book offers an in-depth introduction to the Unicode standard and provides tools and techniques for creating globally interoperable software systems, for programmers and for internationalization specialists. The book begins with a structural overview of the standard and a discussion of its history, then looks at the various writing systems represented by Unicode and the challenges associated with each, and presents strategies for implementing various aspects of the standard. Gillam is a senior development engineer. Annotation c. Book News, Inc., Portland, OR
More Reviews and RecommendationsRichard Gillam is a senior development engineer at Trilogy, a leading developer of large-enterprise e-commerce solutions. He is a former member of IBM's Globalization Center of Competency, where he was one of the original designers of the open-source International Components for Unicode and was responsible for several of the international frameworks in the Java Class Libraries. Rich is a former columnist for C++ Report, a regular presenter at the International Unicode Conferences, and a Specialist Member of the Unicode Consortium.
"Rich has a clear, colloquial style that allows him to make even complex Unicode matters understandable. People dealing with Unicode will find this book a valuable resource."
--Dr. Mark Davis, President, The Unicode Consortium
As the software marketplace becomes more global in scope, programmers are recognizing the importance of the Unicode standard for engineering robust software that works across multiple regions, countries, languages, alphabets, and scripts. Unicode Demystified offers an in-depth introduction to the encoding standard and provides the tools and techniques necessary to create today's globally interoperable software systems.
An ideal complement to specifics found in The Unicode Standard, Version 3.0 (Addison-Wesley, 2000), this practical guidebook brings the "big picture" of Unicode into practical focus for the day-to-day programmer and the internationalization specialist alike. Beginning with a structural overview of the standard and a discussion of its heritage and motivations, the book then shifts focus to the various writing systems represented by Unicode--along with the challenges associated with each. From there, the book looks at Unicode in action and presents strategies for implementing various aspects of the standard.
Topics covered include:
With this book as a guide, programmers now have the tools necessary to understand, create, and deploy dynamic software systems across today's increasingly global marketplace.
| Foreword | ||
| Preface | ||
| Pt. I | Unicode in Essence: An Architectural Overview of the Unicode Standard | 1 |
| Ch. 1 | Language, Computers, and Unicode | 3 |
| Ch. 2 | A Brief History of Character Encoding | 25 |
| Ch. 3 | Architecture: Not Just a Pile of Code Charts | 61 |
| Ch. 4 | Combining Character Sequences and Unicode Normalization | 111 |
| Ch. 5 | Character Properties and the Unicode Character Database | 139 |
| Ch. 6 | Unicode Storage and Serialization Formats | 185 |
| Pt. II | Unicode in Depth: A Guided Tour of the Character Repertoire | 211 |
| Ch. 7 | Scripts of Europe | 213 |
| Ch. 8 | Scripts of the Middle East | 255 |
| Ch. 9 | Scripts of India and Southeast Asia | 297 |
| Ch. 10 | Scripts of East Asia | 347 |
| Ch. 11 | Scripts from Other Parts of the World | 405 |
| Ch. 12 | Numbers, Punctuation, Symbols, and Specials | 425 |
| Pt. III | Unicode in Action: Implementing and Using the Unicode Standard | 497 |
| Ch. 13 | Techniques and Data Structures for Handling Unicode Text | 499 |
| Ch. 14 | Conversions and Transformations | 535 |
| Ch. 15 | Searching and Sorting | 595 |
| Ch. 16 | Rendering and Editing | 651 |
| Ch. 17 | Unicode and Other Technologies | 699 |
| Glossary | 721 | |
| Bibliography | 811 | |
| Index | 817 |
The convergence of these two trends means that it's no longer just an English-only market for computer software. More and more, computer software is being used not only by people outside the United States or by people whose first language isn't English, but by people who don't speak English at all. As a result, interest in software internationalization is growing in the software development community.
A lot of things are involved in software internationalization: displaying text in the user's native language (and in different languages depending on the user), accepting input in the user's native language, altering window layouts to accommodate expansion or contraction of text or differences in writing direction, displaying numeric values acording to local customs, indicating events in time according to the local calendar systems, and so on.
This book isn't about any of these things. It's about something more basic, and which underlies most of the issues listed above: representing written language in a computer. There are many different ways to do this; in fact, there are several for just about every language that's been represented in computers. Infact, that's the whole problem. Designing software that's flexible enough to handle data in multiple languages (at least multiple languages that use different writing systems) has traditionally meant not just keeping track of the text, but also keeping track of which encoding scheme is being used to represent it. And if you want to mix text in multiple writing systems, this bookkeeping becomes more and more cumbersome.
The Unicode standard was designed specifically to solve this problem. It aims to be the universal character encoding standard, providing unique, unambiguous representations for every character in virtually every writing system and language in the world. The most recent version of Unicode provides representations for over 90,000 characters.
Unicode has been around for twelve years now and is in its third major revision, adding support for more languages with each revision. It has gained widespread support in the software community and is now supported in a wide variety of operating systems, programming languages, and application programs. Each of the semiannual International Unicode Conferences is better attended than the previous one, and the number of presenters and sessions at the Conferences grows correspondingly.
Representing text isn't as straightforward as it appears at first glance: It's not merely as simple as picking out a bunch of characters and assigning numbers to them. First you have to decide what a "character" is, which isn't as obvious in many writing systems as it is in English. You have to contend with things such as how to represent characters with diacrtical marks applied to them, how to represent clusters of marks that represent syllables, when differently shaped marks on the page are different "characters" and when they're just different ways of writing the same "character," what order to store the characters in when they don't proceed in a straightforward manner from one side of the page to the other (for example, some characters stack on top of each other, or you have two parallel lines of characters, or the reading order of the text on the page zigzags around the line because of differences in natural reading direction), and many similar issues.
The decisions you make on each of these issues for every character affect how various processes, such as comparing strings or analyzing a string for word boundaries, are performed, making them more complicated. In addition, the sheer number of different characters representable using the Unicode standard make many processes on text more complicated.
For all of these reasons, the Unicode standard is a large, complicated affair. Unicode 3.0, the last version published as a book, is 1,040 pages long. Even at this length, many of the explanations are fairly concise and assume the reader already has some degree of familiarity with the problems to be solved. It can be kind of intimidating.
The aim of this book is to provide an easier entree into the world of Unicode. It arranges things in a more pedagogical manner, takes more time to explain the various issues and how they're solved, fills in various pieces of background information, and adds implementation information and information on what Unicode support is already out there. It is this author's hope that this book will be a worthy companion to the standard itself, and will provide the average programmer and the internationalization specialist alike with all the information they need to effectively handle Unicode in their software.
There are a few things you should keep in mind as you go through this book:
All of the examples of text in non-Latin writing systems posed quite a challenge for the production process. The bulk of this manuscript was written on a Power Macintosh G4/450 using Adobe FrameMaker 6 running on MacOS 9. I did the original versions of the various diagrams in Microsoft PowerPoint 98 on the Macintosh. But I discovered that FrameMaker on the Macintosh couldn't handle a lot of the characters I needed to be able to show in the book. I wound up writing the whole thing with little placeholder notes to myself throughout describing what the examples were supposed to look like.
FrameMaker was somewhat compatible with Apple's WorldScript technology, enabling me to do some of the example, but I quickly discovered Acrobat 3, which I was using at the time, wasn't. It crashed when I tried to created PDFs of chapters that included the non-Latin characters. Switching to Windows didn't prove much more fruitful: On both platforms, FrameMaker 6, Adobe Illustrator 9, and Acrobat 3 and 4 were not Unicode compatible. The non-Latin characters would either turn into garbage characters, not show up at all, or show up with very compromised quality.
Late in the process, I decided to switch to the one combination of software and platform I knew would work: Microsoft Office 2000 on Windows 2000, which handles (with varying degrees of difficulty) everything I needed to do. I converted the whole project from FrameMaker to Word and spent a couple of months restoring all the missing examples to the text. (In a few cases where I didn't have suitable fonts at my disposal, or where Word didn't product quite the results I wanted, I either scanned pictures out of books or just left the placeholders in.) The last rounds of revisions were done in Word on either the Mac or on a Windows machine, depending on where I was physically at the time, and all the example text was done on the Windows machine.
Like many in the field, I fell into software internationalization by happenstance. I've always been interested in languagewritten language in particularand (of course) in computers. But my job had never really dealt with this directly.
In the spring of 1995, that changed when I went to work for Taligent. Taligent, you may remember, was the ill-fated joint venture between Apple Computer and IBM (later joined by Hewlett Packard) that was originally formed to create a new operating system for personal computers using state-of-the-art object-oriented technology. The fruit of our labors, CommonPoint, turned out to be too little too late, but it spawned a lot of technologies that found their places in other products.
For a while there, Taligent enjoyed a cachet in the industry as the place where Apple and IBM had sent many of their best and brightest. If you managed to get a job at Taligent, you had "made it."
I almost didn't "make it." I had wanted to work at Taligent for some time and eventually got the chance, but turned in a rather unimpressive interview performance (a couple coworkers kidded me about that for years afterward) and wasn't offered the job. About that same time, a friend of mine did get a job there, and after the person who did get the job I interviewed for turned it down for family reasons, my friend put in a good word for me and I got a second chance.
I probably would have taken almost any job there, but the specific opening was in the text and internationalization group, and thus began my long association with Unicode.
One thing pretty much everybody who ever worked at Taligent will agree on is that working there was a wonderful learning experience: an opportunity, as it were, to "sit at the feet of the masters." Personally, the Taligent experience made me into the programmer I am today. My C++ and OOD skills improved dramatically, I became proficient in Java, and I went from knowing virtually nothing about written language and software internationalization to... well, I'll let you be the judge.
My team was eventually absorbed into IBM, and I enjoyed a little over two years as an IBMer before deciding to move on in early 2000. During my time at Taligent/IBM, I worked on four different sets of Unicode-related text handling routines: the text-editing frameworks in CommonPoint, various text-storage and internationalization frameworks in IBM's Open Class Library, various internationalization facilities in Sun's Java Class Library (which IBM wrote under contract to Sun), and the libraries that eventually came to be known as the International Components for Unicode.
International Components for Unicode, or ICU, began life as an IBM developer-tools package based on the Java internationalization libraries, but has since morphed into an open-source project and taken on a life of its own. It's gaining increasing popularity and showing up in more operating systems and software packages, and it's acquiring a reputation as a great demonstration of how to implement the various features of the Unicode standard. I had the twin privileges of contributing frameworks to Java and ICU and of working alongside those who developed the other frameworks and learning from them. I got to watch the Unicode standard develop, work with some of those who were developing it, occasionally rub shoulders with the others, and occasionally contrbute a tidbit or two to the effort myself. It was a fantastic experience, and I hope that at least some of their expertise rubbed off on me.
For a year and a half, I wrote a column on Java for C++ Report magazine, and for much of that time, I wondered if anyone was actually reading the thing and what they thought of it. I would occasionally get an e-mail from a reader commenting on something I'd written, and I was always grateful, whether the feedback was good or bad, because it meant people were reading the thing and took it seriously enough to let me know what they thought.
I'm hoping there will be more than one edition of this book, and I really want it to be as good as possible. If you read it and find it less than helpful, I hope you won't just throw it on a shelf somewhere and grumble about the money you threw away on it. Please, if this book fails to adequately answer your questions about Unicode, or if it wastes too much time answering questions you don't care about, I want to know. The more specific you can be about just what isn't doing it for you, the better. Please write me at rtgillam@concentric.net with your comments and criticisms.
For that matter, if you like what you see here, I wouldn't mind hearing from you either. God knows, I can use the encouragement.
R. T. G.
As the ecomonies of the world continue to become more connected together, and as the American computer market becomes more and more saturated, computer-related businesses are looking more and more to markets outside the United States to grow their businesses. At the same time, companies in other industries are not only beginning to do the same thing (or, in fact, have been for a long time), but are increasingly turning to computer technology, especially the Internet, to grow their businesses and streamline their operations.
The convergence of these two trends means that it's no longer just an English-only market for computer software. More and more, computer software is being used not only by people outside the United States or by people whose first language isn't English, but by people who don't speak English at all. As a result, interest in software internationalization is growing in the software development community.
A lot of things are involved in software internationalization: displaying text in the user's native language (and in different languages depending on the user), accepting input in the user's native language, altering window layouts to accommodate expansion or contraction of text or differences in writing direction, displaying numeric values acording to local customs, indicating events in time according to the local calendar systems, and so on.
This book isn't about any of these things. It's about something more basic, and which underlies most of the issues listed above: representing written language in a computer. There are many different ways to do this; in fact, there are several for just about every language that's been represented incomputers. In fact, that's the whole problem. Designing software that's flexible enough to handle data in multiple languages (at least multiple languages that use different writing systems) has traditionally meant not just keeping track of the text, but also keeping track of which encoding scheme is being used to represent it. And if you want to mix text in multiple writing systems, this bookkeeping becomes more and more cumbersome.
The Unicode standard was designed specifically to solve this problem. It aims to be the universal character encoding standard, providing unique, unambiguous representations for every character in virtually every writing system and language in the world. The most recent version of Unicode provides representations for over 90,000 characters.
Unicode has been around for twelve years now and is in its third major revision, adding support for more languages with each revision. It has gained widespread support in the software community and is now supported in a wide variety of operating systems, programming languages, and application programs. Each of the semiannual International Unicode Conferences is better attended than the previous one, and the number of presenters and sessions at the Conferences grows correspondingly.
Representing text isn't as straightforward as it appears at first glance: It's not merely as simple as picking out a bunch of characters and assigning numbers to them. First you have to decide what a "character" is, which isn't as obvious in many writing systems as it is in English. You have to contend with things such as how to represent characters with diacrtical marks applied to them, how to represent clusters of marks that represent syllables, when differently shaped marks on the page are different "characters" and when they're just different ways of writing the same "character," what order to store the characters in when they don't proceed in a straightforward manner from one side of the page to the other (for example, some characters stack on top of each other, or you have two parallel lines of characters, or the reading order of the text on the page zigzags around the line because of differences in natural reading direction), and many similar issues.
The decisions you make on each of these issues for every character affect how various processes, such as comparing strings or analyzing a string for word boundaries, are performed, making them more complicated. In addition, the sheer number of different characters representable using the Unicode standard make many processes on text more complicated.
For all of these reasons, the Unicode standard is a large, complicated affair. Unicode 3.0, the last version published as a book, is 1,040 pages long. Even at this length, many of the explanations are fairly concise and assume the reader already has some degree of familiarity with the problems to be solved. It can be kind of intimidating.
The aim of this book is to provide an easier entree into the world of Unicode. It arranges things in a more pedagogical manner, takes more time to explain the various issues and how they're solved, fills in various pieces of background information, and adds implementation information and information on what Unicode support is already out there. It is this author's hope that this book will be a worthy companion to the standard itself, and will provide the average programmer and the internationalization specialist alike with all the information they need to effectively handle Unicode in their software.
There are a few things you should keep in mind as you go through this book:
All of the examples of text in non-Latin writing systems posed quite a challenge for the production process. The bulk of this manuscript was written on a Power Macintosh G4/450 using Adobe FrameMaker 6 running on MacOS 9. I did the original versions of the various diagrams in Microsoft PowerPoint 98 on the Macintosh. But I discovered that FrameMaker on the Macintosh couldn't handle a lot of the characters I needed to be able to show in the book. I wound up writing the whole thing with little placeholder notes to myself throughout describing what the examples were supposed to look like.
FrameMaker was somewhat compatible with Apple's WorldScript technology, enabling me to do some of the example, but I quickly discovered Acrobat 3, which I was using at the time, wasn't. It crashed when I tried to created PDFs of chapters that included the non-Latin characters. Switching to Windows didn't prove much more fruitful: On both platforms, FrameMaker 6, Adobe Illustrator 9, and Acrobat 3 and 4 were not Unicode compatible. The non-Latin characters would either turn into garbage characters, not show up at all, or show up with very compromised quality.
Late in the process, I decided to switch to the one combination of software and platform I knew would work: Microsoft Office 2000 on Windows 2000, which handles (with varying degrees of difficulty) everything I needed to do. I converted the whole project from FrameMaker to Word and spent a couple of months restoring all the missing examples to the text. (In a few cases where I didn't have suitable fonts at my disposal, or where Word didn't product quite the results I wanted, I either scanned pictures out of books or just left the placeholders in.) The last rounds of revisions were done in Word on either the Mac or on a Windows machine, depending on where I was physically at the time, and all the example text was done on the Windows machine.
Like many in the field, I fell into software internationalization by happenstance. I've always been interested in language--written language in particular--and (of course) in computers. But my job had never really dealt with this directly.
In the spring of 1995, that changed when I went to work for Taligent. Taligent, you may remember, was the ill-fated joint venture between Apple Computer and IBM (later joined by Hewlett Packard) that was originally formed to create a new operating system for personal computers using state-of-the-art object-oriented technology. The fruit of our labors, CommonPoint, turned out to be too little too late, but it spawned a lot of technologies that found their places in other products.
For a while there, Taligent enjoyed a cachet in the industry as the place where Apple and IBM had sent many of their best and brightest. If you managed to get a job at Taligent, you had "made it."
I almost didn't "make it." I had wanted to work at Taligent for some time and eventually got the chance, but turned in a rather unimpressive interview performance (a couple coworkers kidded me about that for years afterward) and wasn't offered the job. About that same time, a friend of mine did get a job there, and after the person who did get the job I interviewed for turned it down for family reasons, my friend put in a good word for me and I got a second chance.
I probably would have taken almost any job there, but the specific opening was in the text and internationalization group, and thus began my long association with Unicode.
One thing pretty much everybody who ever worked at Taligent will agree on is that working there was a wonderful learning experience: an opportunity, as it were, to "sit at the feet of the masters." Personally, the Taligent experience made me into the programmer I am today. My C++ and OOD skills improved dramatically, I became proficient in Java, and I went from knowing virtually nothing about written language and software internationalization to... well, I'll let you be the judge.
My team was eventually absorbed into IBM, and I enjoyed a little over two years as an IBMer before deciding to move on in early 2000. During my time at Taligent/IBM, I worked on four different sets of Unicode-related text handling routines: the text-editing frameworks in CommonPoint, various text-storage and internationalization frameworks in IBM's Open Class Library, various internationalization facilities in Sun's Java Class Library (which IBM wrote under contract to Sun), and the libraries that eventually came to be known as the International Components for Unicode.
International Components for Unicode, or ICU, began life as an IBM developer-tools package based on the Java internationalization libraries, but has since morphed into an open-source project and taken on a life of its own. It's gaining increasing popularity and showing up in more operating systems and software packages, and it's acquiring a reputation as a great demonstration of how to implement the various features of the Unicode standard. I had the twin privileges of contributing frameworks to Java and ICU and of working alongside those who developed the other frameworks and learning from them. I got to watch the Unicode standard develop, work with some of those who were developing it, occasionally rub shoulders with the others, and occasionally contrbute a tidbit or two to the effort myself. It was a fantastic experience, and I hope that at least some of their expertise rubbed off on me.
For a year and a half, I wrote a column on Java for C++ Report magazine, and for much of that time, I wondered if anyone was actually reading the thing and what they thought of it. I would occasionally get an e-mail from a reader commenting on something I'd written, and I was always grateful, whether the feedback was good or bad, because it meant people were reading the thing and took it seriously enough to let me know what they thought.
I'm hoping there will be more than one edition of this book, and I really want it to be as good as possible. If you read it and find it less than helpful, I hope you won't just throw it on a shelf somewhere and grumble about the money you threw away on it. Please, if this book fails to adequately answer your questions about Unicode, or if it wastes too much time answering questions you don't care about, I want to know. The more specific you can be about just what isn't doing it for you, the better.
For that matter, if you like what you see here, I wouldn't mind hearing from you either. God knows, I can use the encouragement.
--R. T. G.
loading...
loading...
loading...
Terms of Use, Copyright, and Privacy Policy
© 1997-2010 Barnesandnoble.com llc


