The naive and obvious version
Seen surprisingly often in various avatars, it is a C mode of thinking goes something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vector<string> split(string s) | |
{ | |
vector<string> sRet; | |
size_t i = 0; | |
do | |
{ | |
string sWord; | |
// Find beginning of a word | |
while(i < s.size() && s[i] == ' ') ++i; | |
// Accumulate chars until end of a word or the input | |
while(i < s.size() && s[i] != ' ') | |
{ | |
sWord += s[i]; | |
i++; | |
} | |
// If we have a word save it | |
if(!sWord.empty()) | |
sRet.push_back(sWord); | |
} | |
while(i < s.size()); | |
return sRet; | |
} | |
Very verbose and tedious!
Of course on topcoder they usually write tersely equivalent code on a single line #defined as a macro or an inline function.
Can we do better?
Here is a more C++ like version
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vector<string> split2(string s) | |
{ | |
vector<string> sRet; | |
istringstream iss(s); | |
string sWord; | |
while(iss >> sWord) | |
sRet.push_back(sWord); | |
return sRet; | |
} | |
Still smells like C
How about
Think in functional terms, get the loop mentality out of the way - The STL is your friend :
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vector<string> split3(string s) | |
{ | |
istringstream iss(s); | |
istream_iterator<string> begi(iss), endi; | |
return vector<string>(begi, endi); | |
} | |
Nice and neat!
Can this be a one-liner?
Let's try putting everything together...
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vector<int> v(istream_iterator<int>(istringstream(s)), istream_iterator<int>()); | |
Unfortunately this is not correct. The temporary istringstream dies after the istream iterator is constructed, so bad things will happen. The constructor of the istream_iterator takes a istream& rather than a const istream& otherwise it would be workable - See this
Actually should have been obvious...
You cannot read stuff out of a const reference to a stream since it changes state when you do.
What about other delimiters?
C++ streams are tied to locales, so they read stuff based on the definition of isspace() for the current locale. There seems to be no way to manufacture our own locales for splitting strings on things other than whitespace, so we need a different approach:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vector<string> split(const string& s, string seps=" ") | |
{ | |
vector<string> ret; | |
for(auto ps = &s[0], pd = ps, pe = ps + int(s.size()); ps < pe; ps = pd + 1) | |
{ | |
ret.emplace_back(string(ps, pd = find_first_of(ps, pe, begin(seps), end(seps)))); | |
} | |
return ret; | |
} | |
Neat huh?
If that looks like Greek and Latin to you, don't worry, it looks that way to me too, three days after writing it.
Let's analyze what the heck is going on here...
The intent is as follows - we want to maintain two pointers/iterators - one points to the start of a word, and the other points to a separator (or end of string)
The for loop starts by initializing ps(pointer to start), pe(pointer to end-of-string).
Since we use auto we have to assign something to pd(pointer to delimiter), so might as well assign ps to it. We are going to loop until ps reaches pe.
Next we place the next available token into the result vector - emplace_back() is a function that can push an element onto a vector without copying the value passed into it - it uses move semantics to achieve this. Whenever you push a temporary object onto a vector, use emplace_back()
Remember never to reuse the emplaced object again, or else...
OK, but what the hell is "the next available token" mentioned above?
It is a string consisting of an iterator range - the start of the range is obviously ps. The end of the range is the next delimiter in the string, or the end of the string if no such delimiter exists.
Well that's exactly what find_first_of() does - returns an iterator in a range that has any one of the values you give it (here the characters in the string seps). If not found, it returns the end of the range, which is what we want.
So all we need to do is set ps to the point after the delimiter, so we can grab the next token - this is done in the increment part of the for loop. If it goes past the end, were done. Assigning pd on the fly inside the expression is a neat way to avoid one extra line of code. Remember that everything in C/C++ except control statements return a value (In fact and BTW, in GNU C, there is an extension that lets every statement be a value).
Not too bad at all! I can't think of a more neat version than this. I have some vague idea about using adjacent_find(), any_of() and so on, but I suspect it won't be as neat as this.
One thing you might notice is that I use pointers instead of iterators. The point is that pointers are the best kind of iterator, and if you know that the underlying object is contiguous, one might as well use a pointer than an iterator (the compiler will actually generate almost the same code for an iterator in this case). There is nothing evil about pointers, as long as you use them sparingly in a type safe way.
Here's how it looks in the brief form using the macros from my topcoder template:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vstr split(cstref s, st seps=" ") | |
{ | |
vstr ret; | |
for(auto ps = pbeg(s), pd = ps, pe = ps + sz(s); ps < pe; ps = pd + 1) | |
ret.eb(st(ps, pd = find_first_of(ps, pe, all(seps)))); | |
return ret; | |
} | |
Try writing a similar split in your favorite language for kicks!
No comments:
Post a Comment