Unicode Made Easy in Visual Studio .NET 2003
by Ryan Geiss - 29 June 2008
What is Unicode?
Try loading a filename into your program with funky foreign characters in it -
it will probably break. You might be able to get away with some of the accented
characters in the latin alphabet, and a few other unusual symbols, but what about
a filename completely in Arabic, or Vietnamese, or Kanji? Unicode is the answer -
it's a 16-bit wide character (rather than the traditional 8-bit latin-based 'char')
that can encode any character from just about any language on the planet. And,
believe it or not, it's actually very easy to modify your program to work with it.
I did it recently, for the first time (shame on me for waiting so long), and was
very surprised at the ease with which the change was made.
In Visual Studio .NET 2003, writing an application in C/C++, here are the changes
I had to make in order for Unicode to work, so my application could handle strings
(filenames) in any language. Before beginning, realize that the type 'TCHAR'
maps to two different things, depending on your project settings... if Unicode is
turned on, it maps to WCHAR, which is an unsigned short; and if Unicode is
turned off, it maps to char. In [Dev Studio] Unicode, a null-terminated string
is just terminated by a 16-bit zero instead of an 8-bit zero. Whatsmore, the first
255 values in Unicode perfectly match the 255 bytes of ASCII, so when you check
if a TCHAR == 'a', it's going to work, either way. It's pleasantly surprising
how interchangeable the code is, once you switch to TCHARs.
Here are the steps:
1. add both of these symbols to your project, as preprocessor definitions:
[Project Settings -> C/C++ -> Preprocessor -> Preprocessor Definitions]
(and don't forget to update both Debug and Release configurations...)
UNICODE
_UNICODE
2. replace all of your 'char' variables with 'TCHAR' (...at least, the ones that
are used for string processing - not the ones where you meant to use BYTE).
3. put a _T() around all string literals; i.e. change all string literals from
"my string" into _T("my_string"). This is the most time-consuming part.
However, Visual Studio has a handy and easy macro feature that will let
you burn through this in no time.
4. paste this code into your 'global' header file, that all of your code can see:
#include <tchar.h>
#define my_strcmpi(x,y) _wcsicmp(x, y)
#define my_strcmp(x,y) wcscmp(x, y)
#define my_strrchr(x,y) wcsrchr(x, y)
#define my_strcat(x,y) wcscat(x,y)
#define my_strcpy(x,y) wcscpy(x,y)
#define my_strstr(x,y) wcsstr(x,y)
// for these, be aware that 'n' is the # of CHARACTERS, not bytes!!
#define my_strlen_chars(x) wcslen(x)
#define my_strlen_bytes(x) (wcslen(x)*sizeof(TCHAR))
#define my_strnicmp(x,y,n) _wcsnicmp(x, y, n)
#define my_strncat(x,y,n) wcsncat(x, y, n)
#define my_strncpy(x,y,n) wcsncpy(x, y, n)
#define my_sprintf swprintf
#define my_sscanf swscanf
#define my_fopen _wfopen
#define my_fclose fclose
#define my_fprintf fwprintf
5. go through your code and prefix all of your calls to string-manipulation
function (in the standard C library) with "my_" to work with the
remapping in the previous step. For example, strrchr() becomes my_strrchr(),
strcmpi() becomes my_strcmpi(), and so on.
6. All of your string manipulation code that used to work on char* 's will
still work just fine - because, when Unicode is enabled, a TCHAR maps to
WCHAR (when off, it maps to char) which is basically an unsigned short.
If you advance a TCHAR* by one (p++), it advances one character -
woohoo. If you check a TCHAR* to see if it's end-of-string (*p==0),
it works - woohoo again. However, there is one place you can
trip up... strlen.
7. Check everywhere you've used strlen(), and see if you were treating the return
value as a number of BYTES, or a number of CHARACTERS. I recommend rewriting
your code to treat it as the number of CHARACTERS. Keep an eye on what you
*do* with the result of strlen(). For example:
char szData[] = "This is a test string.";
char* p = szData;
p += strlen(szData);
strcpy(p, ".. and this is a second string.");
...made Unicode-friendly, using the steps above, it works just fine:
TCHAR szData[] = _T("This is a test string.");
TCHAR* p = szData;
p += my_strlen_chars(szData);
my_strcpy(p, _T(".. and this is a second string."));
However, if we were using strlen to get the number of chars in a string,
and using that as if it were the number of BYTES in the string, THEN we'd
be in trouble. For example:
TCHAR szData[] = "This is a test string.";
TCHAR szCopy[256];
int nChars = strlen(szData); // WRONG BAD NO NO NO
char* p = szData;
memcpy(szCopy, szData, nChars+1); // OOPS - only got half the string!
The above code tries to copy a string, assuming that the number returned
by strlen() can be used as the # of bytes in the string. WRONG. It can't.
So - just double-check wherever you use strlen(), and you will probably be
just fine.
8. A few other miscellaneous little things you should check:
-change std::string to std::wstring
-check all fread/fwrite calls (when used to read/write strings - check
the # of bytes!)
-check all memcpy / memset / ZeroMemory calls (when used to copy/clear
strings)
-check for REG_SZ - should use REG_MULTI_SZ instead
--------------------------------------------------------------------
Again, the great thing about these changes is that now, if you remove the
UNICODE/_UNICODE #defines from your project, voila - you are back to regular
non-unicode, and your code works fine. No code changes at all. :) (Well,
you also have to modify my list of string-manipulation function mappings
above to also remap to the non-unicode case... using #ifdefs, a one-time,
two-minute task.)
That's about it for now... hopefully, several billion people will thank
you for taking the time to do this!
This document copyright (c)2008+ Ryan M. Geiss.
Return to Articles