Tuesday, September 9, 2014

Parser to DB

I continue working on my C/C++ parser with the goal of parsing the data straight into my PostgreSQL db.
After some minor complications with interactions between Code Blocks, my compiler, PostgresSQL's official library and my OS, I managed to setup a connection to the DB and  we are moving full speed ahead.

We are going to be working with chapter 31 of official PostgreSQL documentation which deals with the libpq C library.
Our long term goal remains:

  • processing close to 100 files of untidy data usually (but not always) in the form "token=value," 
  • each file contains around 15 million lines of data 
  • we have a small sample file of 1 million lines for testing. 

After optimizing the parser so that it would split the messages  and write to files in the last post, we got it down to ~3.8 seconds for the small file. (we have to keep in mind that this time if for transforming the file into a tidy format without writting it to DB, the files would need to get written to the DB after this conversion happens)

Enough background let's talk changes! :D

We need to store 3 of the messages into the db, the Time Of Sales, Level 1 and Level 2 messages. Our db is split into daily partitions to keep things fast, so the parser must first create a partition for the date being read from files, then open a transaction, write all the messages and close the transaction. It's a simple enough plan.

Initial approach. 
On the first iteration I built a new function to parse the TimeOfSale messages into an SQL INSERT string with the values from the file and execute it into the db using PQexec with some error checking. The result is effective (it writes the data to database) but it is not efficient as it takes the time to 56.74 seconds on the 1 million line file! Optimizations are needed, very badly.

First optimization prepare the statement on the db.
The idea of this optimization is that everytime the DB gets a message saying INSERT into MYTABLE(field1, field2,...fieldn) VALUES ('1', 2, .. 'n'); it must:

  1. Parse the message 
  2. Plan the execution of the instruction. 
  3. Convert the data from string into the right formats. 
  4. Check for restrictions. 
  5. Write into the table. 
On top of this, the network overhead of sending the first part of the message over and over again, a few million times adds up to more than a few bytes. 
By preparing the statement on the DB, we are able to keep the bulk of the message in the db and send only the values for every line.
It also allows the server to parse and plan the message only once, and then execute it each time without the overhead of steps 1 and 2.  Avoiding the repetition of these 2 steps at least a million times. 

As a first step in this direction I use the SQL prepare statement and EXECUTE statements through PQexec. This approach reduces the write time to 48.73 seconds, for the 1 million line file, a 16% reduction. Not bad, but there is still room for improvement. 

As a test, I had the parser write a file with all the Insert statements, and then had PostgreSQL process the file. The result was a much faster time to write the million lines to db (24 seconds without preallocating the instruction!) This results showed me that I still have bottlenecks on the C side. So I started asking around and it seems that the cause for this difference is that in order to write each line to the DB from the C code I must do up to 3 kernel calls, since PostgreSQL is  being reached through a network protocol the kernel needs to read each message, allocate size for it, then send the message. While reading from a file to the db is similar to reading a file during a parse operation and therefore faster. In order to optimize around this issue, we are going to accumulate the write messages into a much larger buffer before sending each write to PostgreSQL.
With a buffer of 1024*1024 characters the system takes 10.47 seconds to parse the TOS entries from the 1 million line file and write the TOS messages to DB.. 

With this done, I consider that the parser is ready-enough to go into production and I'll move to new challenges! :D