Saturday, 27 May 2017

Why is splitting a string slower in C++ than Python?



I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).



Python Code:




#!/usr/bin/env python
from __future__ import print_function
import time
import sys

count = 0
start_time = time.time()
dummy = None


for line in sys.stdin:
dummy = line.split()
count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print(" Crunch Speed: {0}".format(lps))
else:

print('')


C++ Code:



#include                                                               
#include
#include
#include
#include


using namespace std;

void split1(vector &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);

// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);


while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}


void split2(vector &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}

int main() {

string input_line;
vector spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);

cin.sync_with_stdio(false); //disable synchronous IO

while(cin) {
getline(cin, input_line);

spline.clear(); //empty the vector for the next line to parse

//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);

count++;
};

count--; //subtract for final over-read

sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp



Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).



I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"



Results:



$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real 0.01 user 0.38 sys

Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444


What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?




Edit 1 / Partial Solution?:



I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:



for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1



The performance of python is now about the same as the split1 C++ implementation.



/usr/bin/time cat test_lines_double | ./split5.py
22.61 real 0.01 user 0.40 sys
Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090


I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)




Thanks to all for your help.



Final Edit/Solution:



Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):



$ /usr/bin/time cat test_lines_double | ./split6
15.09 real 0.01 user 0.45 sys
C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333



My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.



One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.



Thanks again to all for your suggestions!


Answer



As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.



If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).




The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):



#include                                                               
#include
#include
#include
#include

using namespace std;


class StringRef
{
private:
char const* begin_;
int size_;

public:
int size() const { return size_; }
char const* begin() const { return begin_; }

char const* end() const { return begin_ + size_; }

StringRef( char const* const begin, int const size )
: begin_( begin )
, size_( size )
{}
};

vector split3( string const& str, char delimiter = ' ' )
{

vector result;

enum State { inSpace, inToken };

State state = inSpace;
char const* pTokenBegin = 0; // Init to satisfy compiler.
for( auto it = str.begin(); it != str.end(); ++it )
{
State const newState = (*it == delimiter? inSpace : inToken);
if( newState != state )

{
switch( newState )
{
case inSpace:
result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
break;
case inToken:
pTokenBegin = &*it;
}
}

state = newState;
}
if( state == inToken )
{
result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
}
return result;
}

int main() {

string input_line;
vector spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);

cin.sync_with_stdio(false); //disable synchronous IO

while(cin) {
getline(cin, input_line);

//spline.clear(); //empty the vector for the next line to parse

//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
//split2(spline, input_line);

vector const v = split3( input_line );
count++;
};


count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
}


//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x


Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...