Saturday, August 23, 2014

my first C/C++ parser! :D


Last week I used C/C++ for a "real-life" problem for the first time.


The task was a relatively simple parse, of a set of about 100 files each with about 15 million lines of untidy data where there are 10 different message types that need to be parsed, re-factored into a tidy format and stored into separate files for each message type.

I built the program and benchmarked on an i7 (using a single core for the program) with 8gb of RAM and a regular SATA disk (no SSD) running on Fedora 20. 

Previous attempts to solve the problem on Python had been unsuccessful, up to 30 minutes to parse one of the files (thnx in part to Python sucking up all my RAM and slowing to a halt...)

So, I got a sample file from the server to my local machine, and set myself to get the task done with C/C++. In order to measure my progress during debugging, I made a smaller  file with the first 1,000,000 lines from the sample file.   
I started by building the parser using boost to split each line into a vector of token=value fields, and a set of functions to process each message type. 
On the first iteration, I would process each field to find the field that will tell me which type of message it is, and send it to the right function. Once in the function I would iterate over all the fields checking with an if-else structure to put the value into a variable matching its token and output a formatted a string; using boost again to assign the variables back into a pre-formatted string, I would then feed that string a long with the name of the message to a writer function that would open the target file, write to it and close it back. 
Run time for the million line file using this approach was ~23 seconds. Run time on the full file was still running to slow, way past 5 minutes per file... 

--luckly I have an experienced C and ASM developer sitting next to me :D so, it was easy to find out where I was wasting a lot of those 23 seconds and start optimizing this thing with his help and his lectures (and a lot of stackexchange and google!) :D 

First optimization -> open and close files once! 
Turns out that opening a file and closing it again is a time consuming call to the kernel and the OS... that can suck a lot of time! So, I changed the processing functions and changed them to return the formatted string instead of just writting it into the file. I moved the opening of the files to the main function, and made a vector of files <ofstreams> one for each file. So I opened the files once  before reading all the lines in the input file, and closed them once at the end.
Result... run time on the million line file went down to around 13 seconds. The full file was running on a bit over 2 minutes. Not bad, but not quite there yet.

Second optimization -> vectors.
The second optimization that I made was a little trickier to get my head around it... and it fixed several issues at once... 
My goal was to remove the loop inside the functions used to process each line, Inside the loop I was processing the line already split into fields, then splitting each field into token=value pairs and classifying them with an if-else tree into various variables. 
This approach is inefficient in several fronts. 
1. splitting the line into fields costs a few kernel calls. 
2. splitting each field into token, value pairs cost a few more vector calls. 
3. the loop inside the line process functions feels a little O(n2) 
4. adding the variables back into the formated string was a bit of an unnecessary step... 

So instead of the splitting into vectors I left the lines as strings, and processed them using strstr() -- i thought about using the string.find() method instead but decided to go with the C approach. On the main function i would grab each line and strstr it looking for the token to classify them into the processing functions on an if-else tree (little change here except for not using boost to split into tokens). Once inside the processing function, I used strstr again to find each of the tokens (no for loop!) and the value using a function that finds the pointer on the string for the end of the token and the end of the value. Then I add the text into the output string and move to the next token. 
The function returns a string that is then written to the corresponding output file.
Total time on the million line file -> ~3.8 seconds. Full file ran in about 45 seconds. This was friday night around 830 pm... so it was good enough for the week! :D

Overall it was a great learning experience. Seeing the difference in performance from minimizing kernel calls, and removing an inner for loop was a bit of an eye opener. Also a very good first experience using C/C++ to solve a real problem. The next big optimization would be to run on multi-threading on all cores, something I've done in in Python before but uncharted waters for me on the C/C++ side of things. 
:)  

No comments:

Post a Comment