Difficulties in writing a csv parser

Definitions

The reason

After fiddling around in awk for a while, needed to use a csv file for data, this file however was BIG, after looking around, I found tools to do so, but I needed to speedup the parsing. The parser I used was a module in the awk-libs written by e36freak. and although it did its job, and although it worked, could you speed it up? the parser was written using a state machine, but, this is awk, what if we used a regex instead?

The csv format

We all know csv, we’ve all used it, to this day it is the quickest “database” format.

The simple csv

field1,field2

Pretty simple no? A simple FS=“,” is sufficient to handle this in awk. no need for a state machine. but what if you have a field with a ,? that would mess up the parsing terribly.

Solution: FS=“,”

The stringy csv

( be aware this is my recollection of writing the csv.awk unit most of these Solutions could fail, and I am unwilling to test them.

field1,“field,2”

with the adition of strings, a simple split() cannot be done, a FPAT could be used, but not quite so, can we avoid a state machine? yes! using something to the equivalent of lua’s string.gmatch

Solution: /^([^“,]|”([^“]|”“)”),/

In reality the entire solution is more like while( match(csvrecord,/^([^",]|"([^"]|"")*")*,/) ){ $(++NF) = substr(csvrecord,RSTART,RLENGTH-1) csvrecord = substr(csvrecord,RSTART+RLENGTH) } $(++NF) = csvrecord