Monday, 12 September 2016

Closing mongodb connection in node.js after inserting many documents



This question has been asked BUT the answer that the OP accepted did not answer my particular needs.



closing mongodb connection in node.js while inserting lot of data



I have a utility script that adds a lot of records to multiple collections. Really it is just an import that uses byline to read the VERY LARGE text files and then inserts the data into a collection:




var MongoClient = require("mongodb").MongoClient;
var fs = require("fs");
var byline = require("byline");

var inStream = fs.createReadStream("data.txt", { encoding: "utf8" });

var byLineStream = byline.createStream(inStream);

MongoClient.connect("mongodb://localhost:27017/test", { native_parser: true}, function(err, db) {
var collection = db.collection("Data");

db.dropCollection("Data", function(err, result) {
byLineStream.on("data", function(line) {
var o = parseLineToObject(line);
collection.insert(o);
});
});
});


The answer suggested was to push all the data into an array and then use a single write and a callback to close the database when it is done. This is not a good answer as the files I am working with are very large and so consume large amounts of memory.




Another solution presented to a similar question was to use the async package to create an array of functions and then run them in parallel. Another bust but at least it doesn't create a huge single insert.



So the question: How do I close MongoDB connection once all the inserts are complete so that my script exits and does not hang?



I should add that I have tried the counting method where I increment a counter variable in the insert callback. It doesn't work because at some point in the inserts, the callbacks execute and complete faster than the inserts complete causing the counter to hit 0 while the inserts are still going, and thus closing the db.


Answer



You should set a flag when all lines have been read:



var readAllLines = false;


byLineStream.on("end", function() {
readAllLines = true;
});


Next, you check for that flag after inserting each record. However, you also need to keep track of the number of lines that have been read, and how many are inserted, so you'll only close the database if all lines have been inserted (even out of order).



Putting everything together:




db.dropCollection("Data", function(err, result) {
var lineCount = 0;
var readAllLines = false;

byLineStream.on("end", function() {
readAllLines = true;
});

byLineStream.on("data", function(line) {
lineCount++;

var o = parseLineToObject(line);
collection.insert(o, { w : 1 }, function() {
if (--lineCount === 0 && readAllLines) {
// we've read and inserted all lines
db.close();
}
});
});
});



However, I do believe that passing a callback to insert ('safe mode') is slower than your current solution, where you call insert but don't wait wait for its result. To speed things up, instead of writing each lines separately, you can buffer an X amount of lines before inserting them in one statement.



Something similar to this (without the line counting):



var buffer = [];
byLineStream.on("data", function(line) {
buffer.push(parseLineToObject(line));
if (buffer.length > 100 || readAllLines) {
collection.insert(buffer, { w : 1 }, function() {

if (readAllLines) {
db.close();
}
});
buffer = [];
}
});

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...