Unicode Made Easy in Visual Studio .NET 2003
by Ryan Geiss - 29 June 2008


  What is Unicode?

  Try loading a filename into your program with funky foreign characters in it - 
  it will probably break.  You might be able to get away with some of the accented
  characters in the latin alphabet, and a few other unusual symbols, but what about
  a filename completely in Arabic, or Vietnamese, or Kanji?  Unicode is the answer -
  it's a 16-bit wide character (rather than the traditional 8-bit latin-based 'char')
  that can encode any character from just about any language on the planet.  And,
  believe it or not, it's actually very easy to modify your program to work with it.  
  I did it recently, for the first time (shame on me for waiting so long), and was 
  very surprised at the ease with which the change was made.
  
  In Visual Studio .NET 2003, writing an application in C/C++, here are the changes
  I had to make in order for Unicode to work, so my application could handle strings
  (filenames) in any language.  Before beginning, realize that the type 'TCHAR'
  maps to two different things, depending on your project settings... if Unicode is
  turned on, it maps to WCHAR, which is an unsigned short; and if Unicode is 
  turned off, it maps to char.  In [Dev Studio] Unicode, a null-terminated string 
  is just terminated by a 16-bit zero instead of an 8-bit zero.  Whatsmore, the first 
  255 values in Unicode perfectly match the 255 bytes of ASCII, so when you check
  if a TCHAR == 'a', it's going to work, either way.  It's pleasantly surprising 
  how interchangeable the code is, once you switch to TCHARs.
  
  Here are the steps:
  
  1. add both of these symbols to your project, as preprocessor definitions:
     [Project Settings -> C/C++ -> Preprocessor -> Preprocessor Definitions]
     (and don't forget to update both Debug and Release configurations...)
        
        UNICODE
        _UNICODE
        
  2. replace all of your 'char' variables with 'TCHAR' (...at least, the ones that 
     are used for string processing - not the ones where you meant to use BYTE).
     
  3. put a _T() around all string literals; i.e. change all string literals from 
     "my string" into _T("my_string").  This is the most time-consuming part.
     However, Visual Studio has a handy and easy macro feature that will let
     you burn through this in no time.
     
  4. paste this code into your 'global' header file, that all of your code can see:
  
        #include <tchar.h>
        
        #define my_strcmpi(x,y)    _wcsicmp(x, y)
        #define my_strcmp(x,y)     wcscmp(x, y)
        #define my_strrchr(x,y)    wcsrchr(x, y)
        #define my_strcat(x,y)     wcscat(x,y)
        #define my_strcpy(x,y)     wcscpy(x,y)
        #define my_strstr(x,y)     wcsstr(x,y)
        
        // for these, be aware that 'n' is the # of CHARACTERS, not bytes!!
        #define my_strlen_chars(x) wcslen(x)
        #define my_strlen_bytes(x) (wcslen(x)*sizeof(TCHAR))
        #define my_strnicmp(x,y,n) _wcsnicmp(x, y, n)
        #define my_strncat(x,y,n)  wcsncat(x, y, n)
        #define my_strncpy(x,y,n)  wcsncpy(x, y, n)
    
        #define my_sprintf         swprintf
        #define my_sscanf          swscanf
        #define my_fopen           _wfopen
        #define my_fclose          fclose
        #define my_fprintf         fwprintf
  
  5. go through your code and prefix all of your calls to string-manipulation 
     function (in the standard C library) with "my_" to work with the 
     remapping in the previous step.  For example, strrchr() becomes my_strrchr(), 
     strcmpi() becomes my_strcmpi(), and so on.
       
  6. All of your string manipulation code that used to work on char* 's will 
     still work just fine - because, when Unicode is enabled, a TCHAR maps to 
     WCHAR (when off, it maps to char) which is basically an unsigned short.  
     If you advance a TCHAR* by one (p++), it advances one character - 
     woohoo.  If you check a TCHAR* to see if it's end-of-string (*p==0), 
     it works - woohoo again.  However, there is one place you can 
     trip up... strlen.
     
  7. Check everywhere you've used strlen(), and see if you were treating the return
     value as a number of BYTES, or a number of CHARACTERS.  I recommend rewriting 
     your code to treat it as the number of CHARACTERS.  Keep an eye on what you
     *do* with the result of strlen().  For example:
     
        char szData[] = "This is a test string.";   
        char* p = szData;
        p += strlen(szData);
        strcpy(p, ".. and this is a second string.");

     ...made Unicode-friendly, using the steps above, it works just fine:
     
        TCHAR szData[] = _T("This is a test string.");   
        TCHAR* p = szData;
        p += my_strlen_chars(szData);
        my_strcpy(p, _T(".. and this is a second string."));
        
     However, if we were using strlen to get the number of chars in a string,
     and using that as if it were the number of BYTES in the string, THEN we'd 
     be in trouble.  For example:

        TCHAR szData[] = "This is a test string.";
        TCHAR szCopy[256];
        int nChars = strlen(szData);  // WRONG BAD NO NO NO
        char* p = szData;
        memcpy(szCopy, szData, nChars+1);   // OOPS - only got half the string!
        
     The above code tries to copy a string, assuming that the number returned
     by strlen() can be used as the # of bytes in the string.  WRONG.  It can't.
     So - just double-check wherever you use strlen(), and you will probably be 
     just fine.

  8. A few other miscellaneous little things you should check:
        -change std::string to std::wstring
      	-check all fread/fwrite calls (when used to read/write strings - check 
      	  the # of bytes!)
      	-check all memcpy / memset / ZeroMemory calls (when used to copy/clear 
      	  strings)
      	-check for REG_SZ - should use REG_MULTI_SZ instead

  --------------------------------------------------------------------        
  
  Again, the great thing about these changes is that now, if you remove the 
  UNICODE/_UNICODE #defines from your project, voila - you are back to regular 
  non-unicode, and your code works fine.  No code changes at all.  :)  (Well, 
  you also have to modify my list of string-manipulation function mappings 
  above to also remap to the non-unicode case... using #ifdefs, a one-time, 
  two-minute task.)
  
  That's about it for now... hopefully, several billion people will thank 
  you for taking the time to do this!
   
   
   
This document copyright (c)2008+ Ryan M. Geiss.
Return to Articles