I am trying to decode php string out from utf-8 to a required encoding (iso-8859-2). The problem is, that the utf-8 string has characters that do not fit in iso-8859-2, but are converted to utf-8 from windows-1251 (although they look exactly the same as if are native for ISO-8859-2). Those characters are represented by "?" on the output.
If I try to convert the same string to windows-1251, the same characters appear, but then the missing characters are respectively the ones native for iso-8859-2 (like "ä","ö", etc.)
I get the strings from a mysql database and need a conversion to a non-unicode charset and storing them into sqlite database file, because the program in which they are going to be used does not support unicode.
So, my question is is there a way to get the possible no-unicode encoding for a character in utf-8? I am currently iterating through the whole utf string and try to decode each character one by one but the windows-1251 characters are still missing.
the code looks like that:
$string = "various charset input";
$str = str_split_unicode($string,1); // The function from the php-str_split manual page, splits utf string into an array
$handler = "";
foreach($str as $value):
$currentChar = iconv("utf-8", "iso-8859-2", $value) or "%no%";
if($currentChar == "%no%" ):
$currentChar = "";
$currentChar = iconv("utf-8", "windows-1251", $value) or "%no%";
endif;
if($currentChar != "%no%"):
$handler .= $currentChar;
else:
$handler .= $value;
endif;
endforeach;
$string = $handler;
But the question marks are still there.
Thanks CertaiN, I edited the function you provided (it may have become less readable though) so it converts the character back to an appropriate encoding.
FUNCTION
function utf8_to_multicharset($str, $encoding, $htmSupportedOutput="iso-8859-15") {
$utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
$out = $utf8;
mb_convert_variables($encoding, 'UTF-8', $out);
is_array($htmSupportedOutput) or $htmSupportedOutput = explode(",",$htmSupportedOutput);
$table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
foreach ($out as $i => &$char) {
if ($char === '?' && $utf8[$i] !== '?') {
$char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
}
elseif (isset($table[$char])) {
$char = $table[$char];
}
foreach($htmSupportedOutput as $o):
$char = html_entity_decode($char,null,$o);
endforeach;
}
return implode('', $out);
}
Now it checks from a list of specified encodings and converts the string to an encoding which supports it like this:
Php usage:
$string = "vatiöus charset иnput";
$result = utf8_to_multicharset($string,"iso-8859-2","cp1252,cp1251,koi8r");
?>
Answer
Do you need HTML Entity Encoding for them?
function utf8_to_escaped_another($str, $encoding) {
$utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
$out = $utf8;
mb_convert_variables($encoding, 'UTF-8', $out);
$table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
foreach ($out as $i => &$char) {
if ($char === '?' && $utf8[$i] !== '?') {
$char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
} elseif (isset($table[$char])) {
$char = $table[$char];
}
}
return implode('', $out);
}
PHP Source Code
function utf8_to_escaped_another($str, $encoding) {
$utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
$out = $utf8;
mb_convert_variables($encoding, 'UTF-8', $out);
$table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
foreach ($out as $i => &$char) {
if ($char === '?' && $utf8[$i] !== '?') {
$char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
} elseif (isset($table[$char])) {
$char = $table[$char];
}
}
return implode('', $out);
}
header('Content-Type: text/html; charset=ISO-8859-2');
$text = <<
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
EOD;
echo '';
echo utf8_to_escaped_another($text, 'ISO-8859-2');
echo '
';
HTML View
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
HTML Source Code
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
No comments:
Post a Comment