Saturday 23 April 2016

Parsing info from a table without headers, using PHP, DOM and cUrl



I need to parse data from a table that i scrape from a different website using PHP.
The table looks like this:






























This table is generated by javascript.
In this table the first tr holds all the td which holds the headers. While all the rest of the table rows hold the info that i need to parse.
Now I've been struggling with this for a while and i found an answer on this website which helped me out a little bit, but it reads the table by using the td and th id's while mine table doesn't have an id on it's table rows or td's.
I'm using cURL to get this table HTML from an other website and pass it through and load it into DOM like this:




include_once('/simple_dom/simple_html_dom.php');
//step1
$cSession = curl_init();
//step2
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($cSession, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($cSession, CURLOPT_COOKIEFILE, $tmpfname);
curl_setopt($cSession,CURLOPT_URL,"http://anonymusurlbecauseofprivacyreasons?somegetters");
curl_setopt($cSession,CURLOPT_RETURNTRANSFER,true);

curl_setopt($cSession, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cSession,CURLOPT_HEADER, false);
curl_setopt ($cSession, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($cSession, CURLOPT_CAINFO, dirname(__FILE__)."/cacert.pem");
curl_setopt($cSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

$result=curl_exec($cSession);
if ($result === FALSE) {

echo "cURL Error: " . curl_error($ch);


}
curl_close($cSession);
// create empty document
$dom = new DomDocument;
@$dom->loadHtml($result);
$xpath = new DomXPath($dom);


Okay so far, so good.

But now comes the part of code which i can't figure out how to get it working.
To read out the date I copied and edited the code from this thread: (How to parse this table and extract data from it?) but I can't get it working.



// collect data
foreach ($xpath->query('//table[@id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}

}
print_r($rowData);


Which gives me the following output:
Array ( [0] => [1] => [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek )



Which is the correct output for the last row, but i need all the rows.
So the kind of output I would need is all of the rows (I only don't need the top rows)
So like
array[1] = ([0] => Mon [1] => 11-11-2013 [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek)




Array[2] = ([0] => Mon [1] => 11-11-2013 [2] => 8 - 9 [3] => S0.20 [4] => name [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => randomresult)
So i can use the info and put it in variables to pass it on to an app.



Anyone knows how to do this? I've been working on this for hours because i have none experience using cUrl or DOM whatsoever.
Any help is much appreciated! :)


Answer



It seems like you're not collecting every row as you go along...



$tableData = array();

foreach ($xpath->query('//table[@id="IWGRD"]/tr') as $node) {

$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
$tableData[] = $rowData;
}

print_r($tableData);


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...


 Dag 

 Datum 

 Lesuur 


 Lokaal 

 Docent(en) 

 Vak 

 Groep(en) 

 Toelichting 

 Di 


 12-11-2013 

 5 - 6 

 B2.33 

 LKH02 

 SWSP14SLB1V13_SWSP15PRA1V13 

 MAV1SP10  


 SLB major 1 / praktijkleren 

Blog Archive