Tuesday, September 9, 2014

Parser to DB

I continue working on my C/C++ parser with the goal of parsing the data straight into my PostgreSQL db.
After some minor complications with interactions between Code Blocks, my compiler, PostgresSQL's official library and my OS, I managed to setup a connection to the DB and  we are moving full speed ahead.

We are going to be working with chapter 31 of official PostgreSQL documentation which deals with the libpq C library.
Our long term goal remains:

  • processing close to 100 files of untidy data usually (but not always) in the form "token=value," 
  • each file contains around 15 million lines of data 
  • we have a small sample file of 1 million lines for testing. 

After optimizing the parser so that it would split the messages  and write to files in the last post, we got it down to ~3.8 seconds for the small file. (we have to keep in mind that this time if for transforming the file into a tidy format without writting it to DB, the files would need to get written to the DB after this conversion happens)

Enough background let's talk changes! :D

We need to store 3 of the messages into the db, the Time Of Sales, Level 1 and Level 2 messages. Our db is split into daily partitions to keep things fast, so the parser must first create a partition for the date being read from files, then open a transaction, write all the messages and close the transaction. It's a simple enough plan.

Initial approach. 
On the first iteration I built a new function to parse the TimeOfSale messages into an SQL INSERT string with the values from the file and execute it into the db using PQexec with some error checking. The result is effective (it writes the data to database) but it is not efficient as it takes the time to 56.74 seconds on the 1 million line file! Optimizations are needed, very badly.

First optimization prepare the statement on the db.
The idea of this optimization is that everytime the DB gets a message saying INSERT into MYTABLE(field1, field2,...fieldn) VALUES ('1', 2, .. 'n'); it must:

  1. Parse the message 
  2. Plan the execution of the instruction. 
  3. Convert the data from string into the right formats. 
  4. Check for restrictions. 
  5. Write into the table. 
On top of this, the network overhead of sending the first part of the message over and over again, a few million times adds up to more than a few bytes. 
By preparing the statement on the DB, we are able to keep the bulk of the message in the db and send only the values for every line.
It also allows the server to parse and plan the message only once, and then execute it each time without the overhead of steps 1 and 2.  Avoiding the repetition of these 2 steps at least a million times. 

As a first step in this direction I use the SQL prepare statement and EXECUTE statements through PQexec. This approach reduces the write time to 48.73 seconds, for the 1 million line file, a 16% reduction. Not bad, but there is still room for improvement. 

As a test, I had the parser write a file with all the Insert statements, and then had PostgreSQL process the file. The result was a much faster time to write the million lines to db (24 seconds without preallocating the instruction!) This results showed me that I still have bottlenecks on the C side. So I started asking around and it seems that the cause for this difference is that in order to write each line to the DB from the C code I must do up to 3 kernel calls, since PostgreSQL is  being reached through a network protocol the kernel needs to read each message, allocate size for it, then send the message. While reading from a file to the db is similar to reading a file during a parse operation and therefore faster. In order to optimize around this issue, we are going to accumulate the write messages into a much larger buffer before sending each write to PostgreSQL.
With a buffer of 1024*1024 characters the system takes 10.47 seconds to parse the TOS entries from the 1 million line file and write the TOS messages to DB.. 

With this done, I consider that the parser is ready-enough to go into production and I'll move to new challenges! :D



Thursday, August 28, 2014

background and intro

I started writing software in the Fall 2012 taking Toronto U's Learn to Program through Coursera. Followed by several other moocs in Machine Learning, Neural Networks, Algorithms, Computational Investments, Computational Econometrics, Data Analysis, etc... By July 2013, I quit my old job as the Manager of Trade Operations and Support for DayTradeTheWorld.com to build a new company with 4 partners and start working towards setting up our own hedge-fund.

With this change I became a full time developer, writting code every day of the week from morning to evening, mostly on Python. After 6 months I started learning SQL taking Stanford's Introduction to Databases mooc and for the best part of the first semester 2014 I'm working most of the time on our PostgreSQL database.

Around April we decide to move many of our DB procedures to C in order to optimize their run time. At this point I took CS50 at Edx to learn the ropes of C, and later I started reading The C++ Programming Language and C++ In Action: Industrial Strenght Programming. My intention is to learn C & C++ using these tools and studying the C code that my partner has already developed for our system.
In this blog I intend to cover the process of learning C/C++ and moving all of the components that I built for our hedgefund in Python during the last year into this new language.

Codeblocks, PostgreSQL and C

I started working on the next steps for my parser. I want it to write the data straight into my db (which uses PostgreSQL). It's a simple enough plan, that should require just a few small adjustments from my parser... in theory. :)

I started with an obvious and misleading Google search for C++ and PostgreSQL... which lead me to the libpq++ library... which looked fine until I noticed that it is only documented for PostgreSQL version 7... which is not a good sign given that Im using version 9.3... that lead to another search which landed me to the currently official library libpq-fe.h, which is a C library but also the official way to interact with Postgres from C/C++

So I write a small program to simply read a table from the DB and print it out to terminal, based on this tutorial: http://www.postgresql.org/docs/8.3/static/libpq-example.html

I made a few modifications to it, so that it'll read from my table and use my db... and everything looked fine... except that I kept getting and "undeclared reference to PQ... " everytime I tried to compile it. 


This lead to a few hours of back and forths with google and stackexchange... during which I made sure I had installed the correct libraries, found that unlike other Linux distros, Fedora puts libpq-fe.h in the best folder /usr/include/ and that everything should work fine after I add the -lpq flag to the compiler. So, I go to codeblocks compiler settings other options and add -lpq as a flag... which doesn't work. :( After a few back and forths (and re-checking everything one last time) I go into the terminal and compile from there using: "g++ main.cpp -lpq" whcih works just fine... So I ask around and get a very educational lesson on Makefiles and on why IDE's suck from my coworkers. I learn that I should've been putting that -lpq flag on the Linker settings for codeblocks... so, after adding pq to Link libraries I am able to read from the DB (Yippy!!!) and print the table to terminal. Which means that Im one step closed to writting my parser directly into the db. 

Saturday, August 23, 2014

my first C/C++ parser! :D


Last week I used C/C++ for a "real-life" problem for the first time.


The task was a relatively simple parse, of a set of about 100 files each with about 15 million lines of untidy data where there are 10 different message types that need to be parsed, re-factored into a tidy format and stored into separate files for each message type.

I built the program and benchmarked on an i7 (using a single core for the program) with 8gb of RAM and a regular SATA disk (no SSD) running on Fedora 20. 

Previous attempts to solve the problem on Python had been unsuccessful, up to 30 minutes to parse one of the files (thnx in part to Python sucking up all my RAM and slowing to a halt...)

So, I got a sample file from the server to my local machine, and set myself to get the task done with C/C++. In order to measure my progress during debugging, I made a smaller  file with the first 1,000,000 lines from the sample file.   
I started by building the parser using boost to split each line into a vector of token=value fields, and a set of functions to process each message type. 
On the first iteration, I would process each field to find the field that will tell me which type of message it is, and send it to the right function. Once in the function I would iterate over all the fields checking with an if-else structure to put the value into a variable matching its token and output a formatted a string; using boost again to assign the variables back into a pre-formatted string, I would then feed that string a long with the name of the message to a writer function that would open the target file, write to it and close it back. 
Run time for the million line file using this approach was ~23 seconds. Run time on the full file was still running to slow, way past 5 minutes per file... 

--luckly I have an experienced C and ASM developer sitting next to me :D so, it was easy to find out where I was wasting a lot of those 23 seconds and start optimizing this thing with his help and his lectures (and a lot of stackexchange and google!) :D 

First optimization -> open and close files once! 
Turns out that opening a file and closing it again is a time consuming call to the kernel and the OS... that can suck a lot of time! So, I changed the processing functions and changed them to return the formatted string instead of just writting it into the file. I moved the opening of the files to the main function, and made a vector of files <ofstreams> one for each file. So I opened the files once  before reading all the lines in the input file, and closed them once at the end.
Result... run time on the million line file went down to around 13 seconds. The full file was running on a bit over 2 minutes. Not bad, but not quite there yet.

Second optimization -> vectors.
The second optimization that I made was a little trickier to get my head around it... and it fixed several issues at once... 
My goal was to remove the loop inside the functions used to process each line, Inside the loop I was processing the line already split into fields, then splitting each field into token=value pairs and classifying them with an if-else tree into various variables. 
This approach is inefficient in several fronts. 
1. splitting the line into fields costs a few kernel calls. 
2. splitting each field into token, value pairs cost a few more vector calls. 
3. the loop inside the line process functions feels a little O(n2) 
4. adding the variables back into the formated string was a bit of an unnecessary step... 

So instead of the splitting into vectors I left the lines as strings, and processed them using strstr() -- i thought about using the string.find() method instead but decided to go with the C approach. On the main function i would grab each line and strstr it looking for the token to classify them into the processing functions on an if-else tree (little change here except for not using boost to split into tokens). Once inside the processing function, I used strstr again to find each of the tokens (no for loop!) and the value using a function that finds the pointer on the string for the end of the token and the end of the value. Then I add the text into the output string and move to the next token. 
The function returns a string that is then written to the corresponding output file.
Total time on the million line file -> ~3.8 seconds. Full file ran in about 45 seconds. This was friday night around 830 pm... so it was good enough for the week! :D

Overall it was a great learning experience. Seeing the difference in performance from minimizing kernel calls, and removing an inner for loop was a bit of an eye opener. Also a very good first experience using C/C++ to solve a real problem. The next big optimization would be to run on multi-threading on all cores, something I've done in in Python before but uncharted waters for me on the C/C++ side of things. 
:)