Jun's Pocket Plane: Linux/Unix text parsing with awk for spreadsheet data

2015년 4월 29일 수요일

Linux/Unix text parsing with awk for spreadsheet data

I track some basic personal health data in a spreadsheet, and track things like time to bed, wakeup time, mealtimes, etc. Simple analysis (average, mean avg. deviation, etc.) is easy to do using built-in spreadsheet functions like AVERAGE().

For conditional calculations, however, you will have to start using IF() statements, which can get complicated if you have several conditions you want to check for.

Consider the following Tab-separated data from a spreadsheet covering the month of September 2014:

Bed Wake DoW Condition nap (hrs) Rest (hrs)
9/2/2014 0:45:00 9/2/2014 5:50:00 1 5.08
9/3/2014 0:10:00 9/3/2014 5:50:00 2 5.67
9/4/2014 1:00:00 9/4/2014 5:50:00 3 4.83
9/5/2014 2:00:00 9/5/2014 11:30:00 4 pretty hung over today 9.50
9/6/2014 2:00:00 9/6/2014 9:45:00 5 1.5 9.25
9/7/2014 2:00:00 9/7/2014 9:45:00 6 7.75
9/8/2014 3:00:00 9/8/2014 10:40:00 0 Chuseok day 7.67
9/9/2014 2:00:00 9/9/2014 11:50:00 1 9.83
9/10/2014 0:30:00 9/10/2014 5:50:00 2 came into the office to study; caught a cold 5.33
9/10/2014 23:30:00 9/11/2014 5:50:00 3 cold sore appears on lip 6.33
9/11/2014 23:15:00 9/12/2014 6:20:00 4 7.08
9/13/2014 1:30:00 9/13/2014 11:00:00 5 9.50
9/14/2014 1:30:00 9/14/2014 11:00:00 6 9.50
9/15/2014 1:00:00 9/15/2014 5:50:00 0 common cold has moved to the chest; phlegm comes out 4.83
9/16/2014 0:00:00 9/16/2014 5:00:00 1 woke up 40 minutes early b/c of bad cough, condition is better than it was on Monday 5.00
9/17/2014 0:00:00 9/17/2014 5:50:00 2 5.83
9/18/2014 0:00:00 9/18/2014 6:00:00 3 6.00
9/19/2014 0:00:00 9/19/2014 5:50:00 4 5.83
9/19/2014 23:00:00 9/20/2014 9:30:00 5 10.50
9/21/2014 0:30:00 9/21/2014 10:00:00 6 9.50
9/22/2014 0:00:00 9/22/2014 5:55:00 0 5.92
9/23/2014 0:00:00 9/23/2014 9:00:00 1 9.00
9/24/2014 12:30:00 9/24/2014 17:30:00 2 5.00
9/24/2014 23:40:00 9/25/2014 5:50:00 3 6.17
9/25/2014 23:30:00 9/26/2014 5:50:00 4 6.33
9/27/2014 1:30:00 9/27/2014 10:00:00 5 1 9.50
9/28/2014 0:30:00 9/28/2014 10:00:00 6 9.50
9/29/2014 0:30:00 9/29/2014 5:55:00 0 5.42
9/29/2014 23:40:00 9/30/2014 5:55:00 1 6.25
9/30/2014 23:50:00 10/1/2014 5:55:00 2 6.08

The 3rd field, DoW (Day Of Week), takes values from 0 to 6, with 0 being Monday and 6 being Sunday. Getting an average value for hours slept Mon~Sun is trivial, as I can simply use AVERAGE() on the 6th column which is the field Rest.

But what if I want to find the average number of hours slept on the weekends (when DoW is 5 or 6)? Doing it the spreadsheet way would require an IF() statement checking if the 3rd field, DoW, is either 5 or 6 and then taking the average of the values in the 6th field, Rest, in the case that the IF conditions are satisfied.

Using Linux/UNIX text parsing tools is simpler, in my opinion. First I will copy the above TSV data into a text file named sept2014.txt

I will now print all lines to stdout (or I could redirect output to a file with > filename) satisfying the condition that the 3rd field contains a 5 or a 6.

$ cat sept2014.txt | awk -F'\t' '$3 == "5" || $3 == "6"'

The -F flag above designates the field separator character, which in the case above is TAB denoted by \t (single-quoted for a string literal). The default field separator in awk is non-TAB whitespace, so if fields are separated by spaces, there is no reason to explicitly state the field separator.

$N where N is some natural number, denotes the field number. $3 == "5" simply checks if the 3rd field has the value 5, while $3 checks if the 3rd field has the value 6.

The output of the above one-liner above is:

9/6/2014 2:00:00 9/6/2014 9:45:00 5 1.5 9.25
9/7/2014 2:00:00 9/7/2014 9:45:00 6 7.75
9/13/2014 1:30:00 9/13/2014 11:00:00 5 9.50
9/14/2014 1:30:00 9/14/2014 11:00:00 6 9.50
9/19/2014 23:00:00 9/20/2014 9:30:00 5 10.50
9/21/2014 0:30:00 9/21/2014 10:00:00 6 9.50
9/27/2014 1:30:00 9/27/2014 10:00:00 5 1 9.50
9/28/2014 0:30:00 9/28/2014 10:00:00 6 9.50

As you can see, only rows corresponding to DoW 5 or 6 (Saturday or Sunday) are printed. I can now copy and paste this data into a new sheet in the existing spreadsheet and calculate the average hours of sleep for the weekends. I think that sometimes quick text parsing with Linux/Unix text utilities is much faster than trying to write your own spreadsheet macro or create a multiply-nested spreadsheet formula.

댓글 없음:

댓글 쓰기

GPG Public Key

-----BEGIN PGP PUBLIC KEY BLOCK-----

Version: GnuPG v2.0.22 (GNU/Linux)

mQENBE7vMHIBCADicCu8p52h2LRAaWZYLoR8BsKptqeJ9O5BWnDtobQGAFa5Xua7

FmrZmhYxKp2vzvlonWmloOP60Zgbxj9rt13S33SLthO+PcKneQkg4dBy/L8fxUaX

8L3n++I/i/qh4l7udUH9QoKNXeHDrAxgJfWcK4eXfImFkIc3EQhz/Ib7mEhIRSbP

gViohOjfwLNy07uf00DjEMvnlF/KY6LfoEEQUvIDmqembQrRXc2castWjL/Hjxae

seEOonMMuvPkvfcJrzfG8F7HJnRs+7e5/HNYA3iNap7JE1cb1huwXIqU7vh4Rd/R

Gq0kVOcGqszhyfuMNmbbbTNBEmzFmSBFtAGLABEBAAG0G0p1biBHbyA8Z29qdW4w

NzdAZ21haWwuY29tPokBOAQTAQIAIgUCTu8wcgIbAwYLCQgHAwIGFQgCCQoLBBYC

AwECHgECF4AACgkQpWkwz3I1E42NBggA19ciarF8DoPk+myx0AhEw8daDsZQ4sl4

j7EBDuB0hHDDthX2jKgPwqenYutF0+2EZQ5VS6kiyFCenK4wtYzkSVwlYuoiUbla

m0EPv8dA0f46/dxRhO2zoF8kfpmnR6BTR+EB+jVM+Mwpmc3shbpspnWPcRH/xTph

YLcURumTrfyIN++SeqSGcw20wg/+zqxclgOkwzZi9K4qIbdI9alPFsP14/xB+dcG

Ukows/TY7/eG3XGgiAE9tLLh99viBdLpSZ5T3GJlOGGnK+8EVZc9VG3yqxSURvBq

+X2CzYUmjsQUo0b7mQxmvFtWmjTbNc4lfWP3kkQdrUOQIe5J8rGfvLkBDQRO7zBy

AQgAsQC6mcxBHSZQzy8NwgZlQZSSx9zFjFVYkgr4xHym67PnkGs2opEvH0SawwMm

LM1/rCWVEeFcHQVaQ41z0Iu2WRIrqzbHreT730R8DqpYGICSp6wbPR5/AfVnwhcf

5I1Vos+cGzhW3kgsrpBCkKfhhDtRY5tseRm/TDMv1SGowsXVEIM/eSqvcNPkPa0f

am7Ah/sXYDg9om7wXbmLhPUz2RfPfHRYSDvRV9lIcvU0+jjVAwfpf3niPgZfsnU2

5smARZtjS1o0/pcFkrcKLE6VeVKFe2VxqJvFtKf4juxaIO1okxtwVcWfCbAGQplb

YiIYz2M7YuD5vqeexrEzxosveQARAQABiQEfBBgBAgAJBQJO7zByAhsMAAoJEKVp

MM9yNRONCb0H/1mV9EPt32R3ZYbUfO07V7GiMNYRZwfTW7ccGROwH1pzaI3ljQKM

FvXOmWg71yNTSiG9eBeSBIpLUXtIwmZvFzOG3B30msZBTStM605hZ9QV0PLxJNdm

61MlZ2EqFqTQYPMKz4Jsn5nZ9FH8wxUJ3QL5zMunE80AjQY4KV7cBswUKQjoDYVq

YIPVjsnchFduIcAMcpwKzTuMbqQih+mrjhr68Zusd44Lhr1g2qGQGCZXRn9/9oOQ

jBMXpeMhJMG/iyTdbO8PNbLFqu4QpHJJzRMphFVkFSBmlqDPcVgoeMazWhQMBg37

No+8Bq/f7QdNm+EJ/DHttuaJXDehVAFYnWE=

=PeRG

-----END PGP PUBLIC KEY BLOCK-----