Php, detecting the possible output encoding for an utf-8 character

Friday, 4 March 2016

Php, detecting the possible output encoding for an utf-8 character

I am trying to decode php string out from utf-8 to a required encoding (iso-8859-2). The problem is, that the utf-8 string has characters that do not fit in iso-8859-2, but are converted to utf-8 from windows-1251 (although they look exactly the same as if are native for ISO-8859-2). Those characters are represented by "?" on the output.

If I try to convert the same string to windows-1251, the same characters appear, but then the missing characters are respectively the ones native for iso-8859-2 (like "ä","ö", etc.)

I get the strings from a mysql database and need a conversion to a non-unicode charset and storing them into sqlite database file, because the program in which they are going to be used does not support unicode.

So, my question is is there a way to get the possible no-unicode encoding for a character in utf-8? I am currently iterating through the whole utf string and try to decode each character one by one but the windows-1251 characters are still missing.

the code looks like that:


$string = "various charset input";

$str = str_split_unicode($string,1); // The function from the php-str_split manual page, splits utf string into an array


$handler = "";

foreach($str as $value):
    $currentChar = iconv("utf-8", "iso-8859-2", $value) or "%no%";

    if($currentChar == "%no%" ):
        $currentChar = ""; 
        $currentChar = iconv("utf-8", "windows-1251", $value) or "%no%";
    endif;


    if($currentChar != "%no%"):

        $handler .= $currentChar;

    else:

        $handler .= $value;

    endif;


endforeach;

$string = $handler;

But the question marks are still there.

Thanks CertaiN, I edited the function you provided (it may have become less readable though) so it converts the character back to an appropriate encoding.

FUNCTION



    function utf8_to_multicharset($str, $encoding, $htmSupportedOutput="iso-8859-15") {


        $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
        $out = $utf8;
        mb_convert_variables($encoding, 'UTF-8', $out);

    is_array($htmSupportedOutput) or $htmSupportedOutput = explode(",",$htmSupportedOutput);

        $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);

        foreach ($out as $i => &$char) {


            if ($char === '?' && $utf8[$i] !== '?') {

                $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');

            } 
            elseif (isset($table[$char])) {

                $char = $table[$char];

            }



        foreach($htmSupportedOutput as $o):

            $char = html_entity_decode($char,null,$o);

        endforeach;
        }

    return implode('', $out);

    }

Now it checks from a list of specified encodings and converts the string to an encoding which supports it like this:

Php usage:



           $string = "vatiöus charset иnput";
       $result = utf8_to_multicharset($string,"iso-8859-2","cp1252,cp1251,koi8r");
    ?>

Answer

Do you need HTML Entity Encoding for them?

function utf8_to_escaped_another($str, $encoding) {
    $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
    $out = $utf8;
    mb_convert_variables($encoding, 'UTF-8', $out);
    $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
    foreach ($out as $i => &$char) {
        if ($char === '?' && $utf8[$i] !== '?') {
            $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
        } elseif (isset($table[$char])) {
            $char = $table[$char];

        }
    }
    return implode('', $out);
}

PHP Source Code


function utf8_to_escaped_another($str, $encoding) {
    $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
    $out = $utf8;
    mb_convert_variables($encoding, 'UTF-8', $out);
    $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
    foreach ($out as $i => &$char) {
        if ($char === '?' && $utf8[$i] !== '?') {
            $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');

        } elseif (isset($table[$char])) {
            $char = $table[$char];
        }
    }
    return implode('', $out);
}

header('Content-Type: text/html; charset=ISO-8859-2');

$text = <<
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
EOD;

echo '';
echo utf8_to_escaped_another($text, 'ISO-8859-2');
echo '
';

HTML View

English: Good Morning
Arabic: صباح الخير
Japanese: おはよう

HTML Source Code

English: Good Morning

Arabic: صباح الخير
Japanese: おはよう

Blog

Friday, 4 March 2016