Saturday, 3 September 2016

php - Detect encoding and make everything UTF-8



I'm reading out lots of texts from various RSS feeds and inserting them into my database.



Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.




Unfortunately, there are sometimes problems with the encodings of the texts. Example:




  1. The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.


  2. Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.


  3. In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.




What can I do to avoid the cases 2 and 3?




How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?



How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:




  1. How do I find out what encoding the text uses?

  2. How do I convert it to UTF-8 - whatever the old encoding is?



Would a function like this work?




function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}


I've tested it, but it doesn't work. What's wrong with it?


Answer




If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.



I made a function that addresses all this issues. It´s called Encoding::toUTF8().



You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.



I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.



Usage:




require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);


Download:




https://github.com/neitanod/forceutf8



I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.



Usage:



require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);



Examples:



echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");



will output:



Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football


I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...