Portal Home > Knowledgebase > Articles Database > PHP - Unicode Conversion
PHP - Unicode Conversion
|Posted by jonathanbull, 06-10-2009, 09:39 PM|
I've been struggling with this for what seems like forever now. What I'm trying to do is create a script that converts any form input into the UTF-16 equivalent.
This guy is doing exactly what I'm after, but I can't for the life of me figure out how it's done. For example, if you enter "abc", "0061 0062 0063" is outputted. If you enter something, say in Chinese, it can figure it out too.
I've tried iconv, mbstring and countless other things but none seem to give the correct results. Any ideas would be really appreciated!
Thanks in advance guys.
Last edited by jonathanbull; 06-10-2009 at 09:44 PM.
|Posted by BMG_Servers, 06-11-2009, 12:21 AM|
|Posted by mwatkins, 06-11-2009, 01:33 AM|
|Almost an aside, but why UTF-16? Do you have a specific need for this? utf-8 covers a good deal of ground and has some real plusses over -16.
|Posted by aniketh, 06-11-2009, 03:19 AM|
|Does the iconv php extension help? (tried to post link to manual but not enough posts)
|Posted by jonathanbull, 06-11-2009, 10:54 AM|
|The data is going to be sent through an SMS gateway. Unfortunately, as often as it will contain character sets such as Arabic and Chinese, the gateway requires it is submitted in UTF-16 HEX form.
Believe me, I wouldn't have chosen it!
|Posted by mwatkins, 06-11-2009, 04:34 PM|
|I'm going to try to help but only half-way. To be honest, lousy support for Unicode data in PHP is one of the major reasons I left PHP behind many years ago.
The example string I'll keep referring to is:
Ç'është Unicode?, in Albanian; 유니코드에 대해? in Korean
Whatever encoding you receive it in from your web app (i.e. a form), you'll need to convert it to UTF-16. The mb functions *ought* to do this; if you are having problems with them, perhaps you aren't getting the data back in the encoding you think you are. Be sure of that before going off elsewhere.
Once you are sure you know what encoding the byte string you have obtained from the user is in, then you can easily use the mb functions to convert to another.
For a Python equivalent (close as I can since Python has native Unicode support) see the function and its matching output "other_encodings" below.
Unicode handling can be vexing, even at the simplest level. I know I initially struggled with this some years ago. It really helps to get a firm grip on what the data we are looking at is. Since some of the encodings look like ASCII (i.e. utf-8 encoded strings will often look like plain ascii in the absence of different characters) and this can lead us down weird rat holes.
Ideally an application internally should do its operations on native unicode strings (assuming your language gas such a data type), converting to a specific encoding only for output. To do this you need to decode the input data (which typically will be encoded as utf-8 or one of the windows encoding and is best to be thought of as a "binary string") into a unicode object or string. To decode you need to know what the source encoding is to begin with, and that isn't always simple to do.
When it is time issue output (to a file, or a http response) you encode to the character set you wish to.
Some web frameworks do all this for you more or less automatically, which is really, really, nice.
You can help force the input encoding by declaring an appropriate character set in your |
|Posted by mwatkins, 06-11-2009, 04:40 PM|
|Sadly some of the characters, notably the XML character entities, are being interpreted by the "code" block rather than output as-is. You'll have to take my word for it, or install Python 3+ and run the script yourself.
Another article worth reading:
Last edited by mwatkins; 06-11-2009 at 04:52 PM.
|Posted by mwatkins, 06-11-2009, 05:20 PM|
|Another bump in the road you may be hitting is the form encoding. I've seen different browser behaviour in the past and this might be introducing another variable into the mix.
Why not try to collect your data using UTF-8 - meaning set the page headers, the form accept-charset attribute, etc all to UTF-8... and then in your communication with the gateway send the mb_convert 'd string as utf-16.
|Posted by jonathanbull, 06-11-2009, 08:47 PM|
|Firstly, thank you all for your replies - I never expected such a helpful response! It truly baffles me that there seems to be no reliable method of conversion for this in PHP whereas other languages like Java can do it in a couple of lines.
Anyway, back to the issue. I've made sure that the page is defined as UTF-8 in the head tags and made sure that the form input is coming in as UTF-8 as well.
Just to reiterate, it's the HEX UTF-16 values that I'm after. I've found bin2hex($str) which works fine for getting the UTF-8 HEX values, just not the UTF-16 ones. I also tried mb_convert_encoding($str, "UTF-16", "UTF-8") but the output always comes out as something crazy like this - a long way away from any HEX values!
Any ideas guys?
Last edited by jonathanbull; 06-11-2009 at 08:51 PM.
|Posted by mwatkins, 06-12-2009, 11:31 AM|
|Unicode handling in PHP is one of the things I loathe about the language and it is in fact one of the main reasons I left (or never returned to) PHP many years ago. The sheer inconsistency of the language is another reason. Having to recompile the interpreter to include basic functionality that a development language should offer to web developers is yet another reason. Performance of the execution model is yet another reason...
Sorry, I don't do that very often. Back on track - I think I can help you understand why your bin2hex($str) isn't working out with UTF-16 values - it is almost certainly because of the BOM - Byte Order Marker which you'll find in *most* strings and files composed of UTF-16 encoded data. The BOM indicates whether the encoding (as a result of the CPU) expects big or little endian data.
To complicate matters further some languagues have support for UCS2 and UCS4; UCS defines a universal character set. UCS 2 and 4 are 2 byte and 4 byte character sets. If you are writing decoding schemes from scratch because your language doesn't have a working implementation then you need to take that into account, too. Sigh.
Back to BOM:
A BOM isn't *typically* found in UTF-8 strings, nor even a certainty in UTF-8 encoded files, although you will see them frequently on Windows-platform originated files.
Editorial comment: I believe UTF-8 was a neat invention designed by practical people. I'm not sure I can say the same about UTF-16.
As UTF-8 encoding is not dependent on the processor, there is no need for a *byte order* marker. However the terminology has grown to cover "marker" of any sort, in a Unicode sense, and *occasionally* you will run into one with UTF-8 data. If found, it simply is there as a marker to denote the encoding type.
mb_convert_encoding should be doing all of this detection for you, passing back only the encoded data. But it doesn't seem to be and perhaps this bug is why it may not be:
Another thought: Have you checked the default internal coding of PHP? If it isn't UTF-8 (chances are it still isn't) that could easily be causing you problems with non ASCII data.
Perhaps you should verify that the input data is indeed UTF-8.
If it isn't, try putting this somewhere early in your code or ideally in a much simpler test code module:
That'll change it from what I bet is the default, iso-8859-1. And if indeed that was the default, you were not likely converting UTF-8 data in the first place. Maybe. Who knows with PHP.
Unicode should not be that hard. In the future virtually all programming languages will have unicode strings as a core. I'm surprised PHP still seems far away from this. For an example of just how easy it should be, every "string" is a Unicode object and every byte string has the ability to be decoded into a Unicode string.
So bottom line: you need to strip away the BOM bytes **unless** the service you are communicating with expects them (and they very well may).
Python might be helpful to you to debug issues - you can double check answers you get from PHP. I've been showing you Python 3.x in this thread, as it has an even cleaner approach to Unicode than it has had for the last decade. If you can tolerate installing Python 3.x I would recommend it for this purpose alone; if not, your system (if it is Unix) probably already has Python 2.5 or 2.6 installed and both will do but the Unicode handling is different in some subtle ways. I'm happy to provide some quick tips on that if you need it.
If you are running on a Unix machine and do elect to put Python 3.1 on, you can avoid overwriting things by downloading and compiling... easy enough:
Note "altinstall" - this will keep Python3 from over-writing any links made to your system's default Python in /usr/bin/python or /usr/local/bin/python. Even if so, it will not ever over-write a major sub-version difference - you can always have multiple python's eg 2.4 / 2.5 / 2.6 / 3.0 / 3.1 installed.
Back to PHP: Have you tried avoiding all web input and simply take a known Unicode character not in the ASCII character set (use one of the escape methods to create it) -- convert that to UTF-16 and view in browser; force your browser to UTF-16 if you must. Keep your tests really really simple with as few inputs as possible until you mash out what is failing. I do understand how maddening this is... what should be dirt simple is taking hours of your time. Been there, done that.
Last edited by mwatkins; 06-12-2009 at 11:36 AM.
Add to Favourites Print this Article
Lower ping? (Views: 319)