Monday, 31 October 2016

fopen - PHP feof() returning true before the end of file



I have been working on a strange PHP problem the last few days where the feof() function is returning true before the end of a file. Below is a skeleton of my code:



$this->fh = fopen("bigfile.txt", "r");    

while(!feof($this->fh))
{
$dataString = fgets($this->fh);

if($dataString === false && !feof($this->fh))
{
echo "Error reading file besides EOF";
}
elseif($dataString === false && feof($this->fh))
{
echo "We are at the end of the file.\n";

//check status of the stream
$meta = stream_get_meta_data($this->fh);
var_dump($meta);
}
else
{
//else all is good, process line read in
}
}


Through lots of testing I have found that the program works fine on everything except one file:




  • The file is stored on the local drive.

  • This file is around 8 million lines long averaging somewhere around 200-500 characters per line.

  • It has already been cleaned and under close examination with a hex editor, no abnormal characters have been found.

  • The program consistently fails on line 7172714 when it believes it has reached the end of the file (even though it has ~800K lines left).

  • I have tested the program on files that had fewer characters per line but were between 20-30 million lines with no problems.

  • I tried running the code from a comment on http://php.net/manual/en/function.fgets.php just to see if it was something in my code that was causing the issue and the 3rd party code failed on the same line. EDIT: also worth mentioning is that the 3rd party code used fread() instead of fgets().

  • I tried specifying several buffer sizes in the fgets function and none of them made any difference.



The output from the var_dump($meta) is as follows:



 array(9) {
["wrapper_type"]=>
string(9) "plainfile"
["stream_type"]=>
string(5) "STDIO"
["mode"]=>
string(1) "r"
["unread_bytes"]=>
int(0)
["seekable"]=>
bool(true)
["uri"]=>
string(65) "full path of file being read"
["timed_out"]=>
bool(false)
["blocked"]=>
bool(true)
["eof"]=>
bool(true)
}


In attempting to find out what is causing feof to return true before the end of the file I have to guess that either:



A) Something is causing the fopen stream to fail and then nothing is able to be read in (causing feof to return true)



B) There is some buffer somewhere that is filling up and causing havoc



C) The PHP gods are angry



I have searched far and wide to see if anyone else was having this issue and cannot find any instances except in C++ where the file was being read in via text mode instead of binary mode and was causing the issue.



UPDATE:
I had my script constantly output the number of times the read function had iterated and the unique ID of the user associated with the entry it found beside it. The script is still failing after line 7172713 out of 7175502, but the unique ID of the last user in the file is showing up on line 7172713. It seems that the problem is for some reason lines are being skipped and are not read. All line breaks are present.


Answer



fgets() is seemingly randomly reading in some lines that do have content as empty. The script actually makes it to the end of the file even though my test that showed the line numbers being read was behind due to the way I did the error checking (and the way the error checking was written in the 3rd party code). Now the real question is what is causing fgets() and fread() to think that a line is empty even though it is not. I will ask that as a separate question as that is a change in topic. Thank you all for your help!



Also, just so no one is left hanging, the reason the 3rd party code did not work is because it relied on a line at least having a line break where the current problem with fgets and fread returning an empty string does not give the script what it needs to know the line ever existed, thus it continues trying to execute past the end of the file. Below is the slightly modified 3rd party script which I still consider excellent based on it's execution speed.



The original script can be found in the comments here: http://php.net/manual/en/function.fgets.php and I take absolutely no credit for it.




//File to be opened
$file = "/path/to/file.ext";
//Open file (DON'T USE a+ pointer will be wrong!)
$fp = fopen($file, 'r');
//Read 16meg chunks
$read = 16777216;
//\n Marker
$part = 0;

while(!feof($fp))
{
$rbuf = fread($fp, $read);
for($i=$read;$i > 0 || $n == chr(10);$i--)
{
$n=substr($rbuf, $i, 1);
if($n == chr(10))break;
//If we are at the end of the file, just grab the rest and stop loop
elseif(feof($fp))
{
$i = $read;
$buf = substr($rbuf, 0, $i+1);
echo "\n";
break;
}
}
//This is the buffer we want to do stuff with, maybe thow to a function?
$buf = substr($rbuf, 0, $i+1);

//output the chunk we just read and mark where it stopped with
echo $buf . "\n\n";

//Point marker back to last \n point
$part = ftell($fp)-($read-($i+1));
fseek($fp, $part);
}
fclose($fp);

?>


UPDATE: After hours more searching, analyzing, hair pulling, etc. it seems that the culprit was an uncaught bad character - in this case a 1/2 character hex value BD. While generating the file that I was reading from the script used stream_get_line() to read the line in from it's original source. It was then supposed to remove all bad characters (it appears that my regex was not up to par) and then use str_getcsv() to convert the content to an array, do some processing, then write to a new file (the one I was trying to read). Somewhere in this process, probably str_getcsv(), the 1/2 character caused the whole thing to just insert a blank line instead of the data. Several thousand of these were placed all throughout the file (wherever the 1/2 symbol appeared). This made the file appear to be the correct length, but for the EOF to be reached too quickly when counting input based on a known number of lines. I want to thank everyone who helped me with this problem and I am very sorry that the real cause had nothing to do with my question. However if it hadn't been for everyone's suggestions and questions I would not have looked in the right places.



Lesson learned from this experience - when EOF is reached too quickly the best place to look is for instances of double line breaks. When writing a script that reads from a formatted file a good practice is to check for these. Below is my original code modified to do just that:



$this->fh = fopen("bigfile.txt", "r");    

while(!feof($this->fh))
{
$dataString = fgets($this->fh);

if($dataString == "\n" || $dataString == "\r\n" || $dataString == "")
{
throw new Exception("Empty line found.");
}

if($dataString === false && !feof($this->fh))
{
echo "Error reading file besides EOF";
}
elseif($dataString === false && feof($this->fh))
{
echo "We are at the end of the file.\n";

//check status of the stream
$meta = stream_get_meta_data($this->fh);
var_dump($meta);
}
else
{
//else all is good, process line read in
}
}

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...