After fiddling around in awk for a while, needed to use a csv file for data, this file however was BIG, after looking around, I found tools to do so, but I needed to speedup the parsing. The parser I used was a module in the awk-libs written by e36freak. and although it did its job, and although it worked, could you speed it up? the parser was written using a state machine, but, this is awk, what if we used a regex instead?
We all know csv, we’ve all used it, to this day it is the quickest “database” format.
field1,field2
Pretty simple no? A simple FS=“,” is sufficient to handle this in awk. no need for a state machine. but what if you have a field with a ,? that would mess up the parsing terribly.
Solution: FS=“,”
( be aware this is my recollection of writing the csv.awk unit most of these Solutions could fail, and I am unwilling to test them.
field1,“field,2”
with the adition of strings, a simple split() cannot be done, a FPAT could be used, but not quite so, can we avoid a state machine? yes! using something to the equivalent of lua’s string.gmatch
Solution: /^([^“,]|”([^“]|”“)”),/
In reality the entire solution is more like
while( match(csvrecord,/^([^",]|"([^"]|"")*")*,/) ){
$(++NF) = substr(csvrecord,RSTART,RLENGTH-1)
csvrecord = substr(csvrecord,RSTART+RLENGTH)
}
$(++NF) = csvrecord