2015년 4월 29일 수요일

Linux/Unix text parsing with awk for spreadsheet data

I track some basic personal health data in a spreadsheet, and track things like time to bed, wakeup time, mealtimes, etc. Simple analysis (average, mean avg. deviation, etc.) is easy to do using built-in spreadsheet functions like AVERAGE().

For conditional calculations, however, you will have to start using IF() statements, which can get complicated if you have several conditions you want to check for.

Consider the following Tab-separated data from a spreadsheet covering the month of September 2014:

Bed Wake DoW Condition nap (hrs) Rest (hrs)
9/2/2014 0:45:00 9/2/2014 5:50:00 1 5.08
9/3/2014 0:10:00 9/3/2014 5:50:00 2 5.67
9/4/2014 1:00:00 9/4/2014 5:50:00 3 4.83
9/5/2014 2:00:00 9/5/2014 11:30:00 4 pretty hung over today 9.50
9/6/2014 2:00:00 9/6/2014 9:45:00 5 1.5 9.25
9/7/2014 2:00:00 9/7/2014 9:45:00 6 7.75
9/8/2014 3:00:00 9/8/2014 10:40:00 0 Chuseok day 7.67
9/9/2014 2:00:00 9/9/2014 11:50:00 1 9.83
9/10/2014 0:30:00 9/10/2014 5:50:00 2 came into the office to study; caught a cold 5.33
9/10/2014 23:30:00 9/11/2014 5:50:00 3 cold sore appears on lip 6.33
9/11/2014 23:15:00 9/12/2014 6:20:00 4 7.08
9/13/2014 1:30:00 9/13/2014 11:00:00 5 9.50
9/14/2014 1:30:00 9/14/2014 11:00:00 6 9.50
9/15/2014 1:00:00 9/15/2014 5:50:00 0 common cold has moved to the chest; phlegm comes out 4.83
9/16/2014 0:00:00 9/16/2014 5:00:00 1 woke up 40 minutes early b/c of bad cough, condition is better than it was on Monday 5.00
9/17/2014 0:00:00 9/17/2014 5:50:00 2 5.83
9/18/2014 0:00:00 9/18/2014 6:00:00 3 6.00
9/19/2014 0:00:00 9/19/2014 5:50:00 4 5.83
9/19/2014 23:00:00 9/20/2014 9:30:00 5 10.50
9/21/2014 0:30:00 9/21/2014 10:00:00 6 9.50
9/22/2014 0:00:00 9/22/2014 5:55:00 0 5.92
9/23/2014 0:00:00 9/23/2014 9:00:00 1 9.00
9/24/2014 12:30:00 9/24/2014 17:30:00 2 5.00
9/24/2014 23:40:00 9/25/2014 5:50:00 3 6.17
9/25/2014 23:30:00 9/26/2014 5:50:00 4 6.33
9/27/2014 1:30:00 9/27/2014 10:00:00 5 1 9.50
9/28/2014 0:30:00 9/28/2014 10:00:00 6 9.50
9/29/2014 0:30:00 9/29/2014 5:55:00 0 5.42
9/29/2014 23:40:00 9/30/2014 5:55:00 1 6.25
9/30/2014 23:50:00 10/1/2014 5:55:00 2 6.08 

The 3rd field, DoW (Day Of Week), takes values from 0 to 6, with 0 being Monday and 6 being Sunday. Getting an average value for hours slept Mon~Sun is trivial, as I can simply use AVERAGE() on the 6th column which is the field Rest.

But what if I want to find the average number of hours slept on the weekends (when DoW is 5 or 6)? Doing it the spreadsheet way would require an IF() statement checking if the 3rd field, DoW, is either 5 or 6 and then taking the average of the values in the 6th field, Rest, in the case that the IF conditions are satisfied.

Using Linux/UNIX text parsing tools is simpler, in my opinion. First I will copy the above TSV data into a text file named sept2014.txt

I will now print all lines to stdout (or I could redirect output to a file with > filename) satisfying the condition that the 3rd field contains a 5 or a 6.

$ cat sept2014.txt | awk -F'\t' '$3 == "5" || $3 == "6"'

The -F flag above designates the field separator character, which in the case above is TAB denoted by \t (single-quoted for a string literal). The default field separator in awk is non-TAB whitespace, so if fields are separated by spaces, there is no reason to explicitly state the field separator.

$N where N is some natural number, denotes the field number. $3 == "5" simply checks if the 3rd field has the value 5, while $3 checks if the 3rd field has the value 6.

The output of the above one-liner above is:

9/6/2014 2:00:00 9/6/2014 9:45:00 5 1.5 9.25
9/7/2014 2:00:00 9/7/2014 9:45:00 6 7.75
9/13/2014 1:30:00 9/13/2014 11:00:00 5 9.50
9/14/2014 1:30:00 9/14/2014 11:00:00 6 9.50
9/19/2014 23:00:00 9/20/2014 9:30:00 5 10.50
9/21/2014 0:30:00 9/21/2014 10:00:00 6 9.50
9/27/2014 1:30:00 9/27/2014 10:00:00 5 1 9.50
9/28/2014 0:30:00 9/28/2014 10:00:00 6 9.50

As you can see, only rows corresponding to DoW 5 or 6 (Saturday or Sunday) are printed. I can now copy and paste this data into a new sheet in the existing spreadsheet and calculate the average hours of sleep for the weekends. I think that sometimes quick text parsing with Linux/Unix text utilities is much faster than trying to write your own spreadsheet macro or create a multiply-nested spreadsheet formula.

댓글 없음:

댓글 쓰기