Jun 4, 2007

Projects that go both ways - Unicode and Not

Allowing a project to support Unicode when programming with C++ in a Microsoft Windows environment can be tricky to those who haven't touched the subject. Here I'll break down the basics, and will go over some simple rules.

When using strings in your code, don't use the type char; instead, use TCHAR. Additionally, types like LPSTR will have a T inserted to indicate a TCHAR - so they will turn into LPTSTR. This type is redefined depending on whether your project is set to use Unicode or not. Formatting normal strings for this type should be enclosed in a TEXT(). Function calls using this type should be preceded with _tcs instead of str.

The macros you should learn are TEXT, _TEXT, _T, and L. _TEXT, _T, and TEXT will convert a string to the appropriate type depending on the project settings/preprocessor definitions. They will all automatically convert ASCII strings to valid Unicode ones when Unicode is defined. _T and _TEXT are literally the same thing, although TEXT is slightly different. TEXT uses the UNICODE preprocessor definition instead of _UNICODE. I honestly am not sure at this point what impacts that could have - more research is needed. L will convert the string to Unicode always.

For an example, use VS2005 to start a new Win32 Console application with default settings - except include common headers for ATL. VS2005 sets a project to default to using Unicode, so it's a good thing to use to observe how things should be done. Once the project is created, we can already see that the main function is actually called _tmain, and that the parameter for the command line arguments if of type _TCHAR *.

Let's enter some code into our main function so that we can see what is happening. Declare a TCHAR array and set it to a string, using the TEXT macro to format the text. Next, print the string using _tprintf. Finally, call MessageBox with the string as either the caption or the text. The body of your main function should look something like below.

TCHAR myNewString[] = TEXT("Hello There!\n");
_tprintf(myNewString);
MessageBox(0,myNewString,TEXT("Some Text Here"),MB_OK);

Run the project, and you will see the output. It's exactly what you would expect. Change the project settings to change between multi byte, Unicode, and none. You'll see that no matter what you choose the project compiles and runs as you would expect.

Place a break point at the MessageBox line and set the project settings to Unicode. Start debugging the application. When it breaks on the MessageBox line, look at your watches, locals, or autos for the address that the string you created is being stored at. Open up a memory view (Debug:Windows:Memory:Memory x). Enter the address for the string. You will see that in memory, each character of your string is taking up two bytes of memory. You'll notice that none of the extra bytes has a value of anything other than 0. This is only because we used ASCII characters and ASCII characters are the first set in Unicode. If you set the project to none or multi byte you will notice that the string is a normal one byte per character ASCII string.

Still confused? Yeah, I am a little too, oh well ;-P