2016년 4월 30일 토요일

Returning the last or second-to-last text field with awk

It you pipe '/' separated text to awk -F "/" '{print $5}'
it will return the 5th field in a string using / as the field separator character.
But if you want to return the last text field in a string with variable length and number of fields, '{print $5}' might not return the last field; imagine if there are 6 or 7 fields.
One way to deal with this is to use the awk variable $NF (number of fields). The relevant man page content is listed below:

The variable NF is set to the total number of fields in the input record.

To return the last field in a string with field separator '/':

awk -F "/" '{print $NF}'
A coworker wanted to return the name of the last sub-directory in some path. He was using a command similar to ls -d */ to return directory contents.
For example, here are the directories in my $HOME folder on my work laptop:

[archjun@pinkS310 ~]$ ls -d */
bin/        Downloads/         kolla/       'SpiderOak Hive/'  'VirtualBox VMs/'
Desktop/    Dropbox/           MyMachines/  SpiderOak_Hive/
Documents/  Images/            ot353/       tmp/
dotfiles/   jun-vagrantfiles/  playground/  txt2regex/
 

I have highlighted directory names containing spaces in yellow above.

Note that each directory is followed by a trailing '/' character. Thus in this case using awk -F "/" '{print $NF}' would return nothing, because nothing follows the final '/':

[archjun@pinkS310 ~]$ ls -d */ | awk -F "/" '{print $NF}'


However, it is possible to return the next-to-last field by specifying the awk variable $(NF-1):

[archjun@pinkS310 ~]$ ls -d */ | awk -F "/" '{print $(NF-1)}'
bin
Desktop
Documents
dotfiles
Downloads
Dropbox
Images
jun-vagrantfiles
kolla
MyMachines
ot353
playground
SpiderOak Hive
SpiderOak_Hive
tmp
txt2regex
VirtualBox VMs
From Stack Overflow, I also learned of a coreutils function named basename, which "strips directory and suffix from filenames".
Examples from the basename man page:

       basename /usr/bin/sort
              -> "sort"

       basename include/stdio.h .h
              -> "stdio"

       basename -s .h include/stdio.h
              -> "stdio"

       basename -a any/str1 any/str2
              -> "str1" followed by "str2"

The -a or --multiple option flag supports multiple arguments, but this option is only available in versions of GNU coreutils after Sept. 2015 (8.23+).
I found that on older machines running CentOS 6.X, the version of coreutils was 8.4 (2013). That version doesn't support some of the basename options above.

The following invocation of basename is almost equivalent to ls -d */ | awk -F "/" '{print $(NF-1)}'with one big catch:

[archjun@pinkS310 ~]$ basename -a $(ls -d */)
bin
Desktop
Documents
dotfiles
Downloads
Dropbox
Images
jun-vagrantfiles
kolla
MyMachines
ot353
playground
SpiderOak
Hive

SpiderOak_Hive
tmp
txt2regex
VirtualBox
VMs

The big drawback of basename is that it incorrectly splits on whitespace when there are spaces in a directory name.
The red highlighted text shows directory names that have been split into separate lines; "SpiderOak Hive" and"VirtualBox VMs" should be treated as single words including the space.
Even using double quotes to prevent word splitting and globbing doesn't help because the word splitting is done in $(ls -d */)
If you look above at the interactive shell output of ls -d */ you will notice that directories with spaces in their names are single-quoted (The yellow-highlighted directories). But what happens when you store the output of ls -d */ into a variable?

[archjun@pinkS310 ~]$ dirsvar=$(ls -d */)
[archjun@pinkS310 ~]$ echo $dirsvar
bin/ Desktop/ Documents/ dotfiles/ Downloads/ Dropbox/ Images/ jun-vagrantfiles/ kolla/ MyMachines/ ot353/ playground/ SpiderOak Hive/ SpiderOak_Hive/ tmp/ txt2regex/ VirtualBox VMs/
[archjun@pinkS310 ~]$ for i in $dirsvar; do echo $i; done
bin/
Desktop/
Documents/
dotfiles/
Downloads/
Dropbox/
Images/
jun-vagrantfiles/
kolla/
MyMachines/
ot353/
playground/
SpiderOak
Hive/

SpiderOak_Hive/
tmp/
txt2regex/
VirtualBox
VMs/

When the ls -d directory output is stored in a variable, word splitting on whitespace occurs, and the single-quoting on the two directories SpiderOak Hive and VirtualBox VMs disappears for some reason. I will try to look for a workaround, but for the time being, it is safer to use awk for the use case of returning the last directory in a path.