Friday 4 March 2016

Php, detecting the possible output encoding for an utf-8 character



I am trying to decode php string out from utf-8 to a required encoding (iso-8859-2). The problem is, that the utf-8 string has characters that do not fit in iso-8859-2, but are converted to utf-8 from windows-1251 (although they look exactly the same as if are native for ISO-8859-2). Those characters are represented by "?" on the output.



If I try to convert the same string to windows-1251, the same characters appear, but then the missing characters are respectively the ones native for iso-8859-2 (like "ä","ö", etc.)




I get the strings from a mysql database and need a conversion to a non-unicode charset and storing them into sqlite database file, because the program in which they are going to be used does not support unicode.



So, my question is is there a way to get the possible no-unicode encoding for a character in utf-8? I am currently iterating through the whole utf string and try to decode each character one by one but the windows-1251 characters are still missing.



the code looks like that:




$string = "various charset input";

$str = str_split_unicode($string,1); // The function from the php-str_split manual page, splits utf string into an array


$handler = "";

foreach($str as $value):
$currentChar = iconv("utf-8", "iso-8859-2", $value) or "%no%";

if($currentChar == "%no%" ):
$currentChar = "";
$currentChar = iconv("utf-8", "windows-1251", $value) or "%no%";
endif;


if($currentChar != "%no%"):

$handler .= $currentChar;

else:

$handler .= $value;

endif;


endforeach;

$string = $handler;


But the question marks are still there.








Thanks CertaiN, I edited the function you provided (it may have become less readable though) so it converts the character back to an appropriate encoding.



FUNCTION





function utf8_to_multicharset($str, $encoding, $htmSupportedOutput="iso-8859-15") {


$utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
$out = $utf8;
mb_convert_variables($encoding, 'UTF-8', $out);

is_array($htmSupportedOutput) or $htmSupportedOutput = explode(",",$htmSupportedOutput);

$table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);

foreach ($out as $i => &$char) {


if ($char === '?' && $utf8[$i] !== '?') {

$char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');

}
elseif (isset($table[$char])) {

$char = $table[$char];

}



foreach($htmSupportedOutput as $o):

$char = html_entity_decode($char,null,$o);

endforeach;
}

return implode('', $out);

}


Now it checks from a list of specified encodings and converts the string to an encoding which supports it like this:





Php usage:





$string = "vatiöus charset иnput";
$result = utf8_to_multicharset($string,"iso-8859-2","cp1252,cp1251,koi8r");
?>

Answer



Do you need HTML Entity Encoding for them?






function utf8_to_escaped_another($str, $encoding) {
$utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
$out = $utf8;
mb_convert_variables($encoding, 'UTF-8', $out);
$table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
foreach ($out as $i => &$char) {
if ($char === '?' && $utf8[$i] !== '?') {
$char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
} elseif (isset($table[$char])) {
$char = $table[$char];

}
}
return implode('', $out);
}




PHP Source Code





function utf8_to_escaped_another($str, $encoding) {
$utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
$out = $utf8;
mb_convert_variables($encoding, 'UTF-8', $out);
$table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
foreach ($out as $i => &$char) {
if ($char === '?' && $utf8[$i] !== '?') {
$char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');

} elseif (isset($table[$char])) {
$char = $table[$char];
}
}
return implode('', $out);
}

header('Content-Type: text/html; charset=ISO-8859-2');

$text = <<
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
EOD;

echo '
';
echo utf8_to_escaped_another($text, 'ISO-8859-2');
echo '
';



HTML View



English: Good Morning
Arabic: صباح الخير
Japanese: おはよう


HTML Source Code



English: Good Morning

Arabic: صباح الخير
Japanese: おはよう


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...