column extracts columns from standard input or named files. By default, input columns are separated by whitespace, but it is also possible to specify columns by character counts, or by specific punctuation delimiters.
As a simple example, the invocation
column 1 3 5 inputfilewould print columns 1, 3, and 5 of the file named inputfile. (In other words, it would do the same thing as the simple awk(1) script
awk '{print $1, $3, $5}' inputfile.)
If no filenames are given, column reads from standard input. Also, a filename of ``-'' indicates standard input.
Because multiple column numbers are entered as separate arguments, there is an ambiguity if an input filename has a name which looks like a number. To resolve the ambiguity, use an alternative pathname for the file which does not begin with a digit. The simplest way to do so is to precede a numeric filename with ``./''.
column can work with several definitions of what a ``column'' is. Input columns separated by whitespace or other delimiter characters are referred to as ``floating'' columns. Input columns specified by character counts are referred to as ``fixed'' columns. Furthermore, floating columns can be delimited in two different ways. Sometimes, particularly when columns are delimited by whitespace, multiple adjacent instances of the delimiter character(s) should count for just one column separation. Other times, when columns are delimited by punctuation characters such as commas, colons, or vertical bars, multiple adjacent instances of the delimiter character should imply the presence of one or more empty columns. (column can handle both of these situations.)
column uses dynamically-allocated memory for input lines and column descriptors, and can therefore be used on input lines with thousands of characters and hundreds of columns (or more).
By default, input columns are floating and are separated by whitespace, that is, by one or more spaces or tabs. In general, floating input columns are defined by two kinds of delimiters: ``exact'' delimiters and ``any'' delimiters. Multiple adjacent instances of an ``exact'' delimiter indicate multiple (empty) columns, while multiple adjacent instances of the ``any'' delimiters indicate a single column division. The default, whitespace-separating behavior is therefore achieved by using an ``any'' delimiter set consisting of the space and tab characters, and an ``exact'' delimiter set which is empty. To select a specific ``exact'' character (or characters), use -e. To select a different set of ``any'' characters, use -a.
Any leading instances of the ``any'' characters on an input line are ignored; they do not indicate the presence of an initial empty column. In fact, there are never any empty columns when only ``any'' characters are used; the only way to achieve empty floating columns is by using leading, trailing, or adjacent ``exact'' characters.
``Exact'' and ``any'' characters may be used simultaneously: for example, using -e to select a comma as the ``exact'' column separator, while leaving the ``any'' delimiter set as the default whitespace, would mean that whitespace at the beginning or end of a comma-separated column would be stripped, and would not appear in the column contents. (Stated another way, though comma is the ``real'' column separator, whitespace surrounding commas is not significant and is not taken to be part of either column. Stated yet another way, input columns would be assumed to be separated by exactly one comma, and zero or more spaces or tabs.) To disable the default ``any'' delimiter characters (that is, to arrange that all input whitespace does appear explicitly in input columns), use -a with an empty argument:
-a ''
It is also possible to specify, with the -q option, that the input consists of floating columns where some column data may contain whitespace or delimiter characters, protected by quotes. (See the examples below.)
Fixed input columns are defined using the -fi option. One -fi option describes one input column; in general, many -fi options will be used to describe the complete input format. The -fi options do not select input columns for printing; they only describe the input columns. The columns to be selected and printed must be requested using numeric arguments, just as for floating columns.
When selecting output columns, several notations may be used. The most basic output column selectors are individual numeric arguments, as in
column 1 3 5 fileColumn numbers can also be separated by commas:
column 1,3,5 fileThe notation m-n specifies a range of columns:
column 2-4 fileThese notations may be combined reasonably arbitrarily:
column 1,3 5-7 9,11-13 file Columns can also be counted from the right edge (that is, from the end of the line). The dollar sign $ is a marker indicating the last column, and the notation $-n indicates columns counted from the right. So $ indicates the last column, $-1 indicates the next-to-last column, $-2 indicates the third-to-last column, etc.
Right-based columns are counted on a line-by-line basis, so the invocation
column $-1 $on a file containing the lines
a b c d e f g h iwould result in the output
b c f g h i
Rather than specifying columns by number, it is possible to specify them by name, if the input file is self-describing by having as its first line a header denoting the column names. The -n option selects an output column by name; multiple -n options are used to select multiple columns. For example, given the input file
a b c d 1 2 3 4 5 6 7 8 9 10 11 12the invocation
column -n b -n dwould select
b d 2 4 6 8 10 12
When the first line is being used as a header, its columns are determined using the same rules as for the remaining ``data'' lines. The header line is processed--columns selected from it and printed--just as for the remaining ``data'' lines, so the first line ends up being a self-describing header for the output, as shown in the example above.
When columns are requested by name using -n, and when simultaneously a comment character is requested using -c, the first line is taken as the column definition line even if it is commented. (Furthermore, if the first, column-definition line is commented, any whitespace between the comment character and the first column name is ignored. That is, if the comment character is #, the first lines ``a b c'', ``#a b c'', and ``# a b c'' would all be treated identically, and would describe a file with three columns named ``a'', ``b'', and ``c''.)
For convenience when requesting many columns by name, the -N option requests that all names appearing on the command line be treated as column names (as if with -n), at the cost of constraining the input to be read from the standard input, rather than a named file.
It is possible to define output columns which should appear at fixed character positions, or which are delimited by specific strings. These output column specifications are made by appending additional information to the selectors which request the columns. For any number m on the invocation command line which requests that column m be selected and printed, the following notations may be used:
To describe a number of similar output columns, the above notations may be combined with the m-n column selection notation. Furthermore, it is also possible to specify a group of disjoint output columns, separated by commas, to which a single output column description notation is attached. See the examples below.
Finally, it is possible to generate arbitrarily-formatted output lines, using the -fmt fmtstr option. This option dispenses with all the other output column specification mechanisms (and, for that matter, it provides its own input column selection mechanism as well). The fmtstr is a skeleton template describing each output line, and in which the notation $n is replaced by the contents of column n. See the example below.
Select columns 1, 3, and 5, with columns separated by arbitrary whitespace:
column 1 3 5
Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace:
column -e , 1 3 5
Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace, but with quotes protecting whitespace or commas which should appear in the columns themselves:
column -q -e , 1 3 5(This is essentially ``CSV'' format.)
Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace, with quotes protecting whitespace or commas which should appear in the columns themselves, and with the output columns protected by quotes if necessary:
column -q -qo -e , 1 3 5
Select columns 1, 3, and 5, with input columns separated by colons and without stripping any whitespace:
column -e : -a '' 1 3 5(This would be useful for parsing UNIX passwd files or related files.)
Select columns 1, 3, and 5, with input columns separated by tabs and without stripping any whitespace:
column -e ' ' -a '' 1 3 5(The character between the single quotes following the -e option is a single tab. This is essentially ``TDF'' format.)
Define input columns running from character positions 1-5, 6-10, 11-20, and 21-50, and print the second and fourth columns:
column -fi 1-5 -fi 6-10 -fi 11-20 -fi 21-50 2 4
Print the first and last columns:
column 1 $
Print the first two and last two columns:
column 1 2 '$-1' $
Print columns 1, 5, and 10 through 20:
column 1 5 10-20
Print all but columns 1, 5, and 10 through 20:
column -v 1 5 10-20
Print columns 1 and 3 (whitespace delimited) from file a, followed by 2 and 4 from file b:
column -m 1 3 a 2 4 b
Print columns 1 and 3 from file a, with column 2 from file b interspersed (that is, print column 1 from file a, followed by column 2 from file b, followed by column 3 from file a again):
column -m 1 a 2 b 3 a
Print columns 1 and 3 from standard input, with column 2 from file b interspersed:
column -m 1 - 2 b 3 -
Select column 1 and print it beginning at output position 1, and column 3 beginning at output position 10:
column 1:1 3:10
Select input columns 1 and 3, printing them in output columns in positions 1-9 and 11-20:
column 1:1,9 3:11,20
Select input columns 1 and 2, suffixing the first output column with a comma and a space and the second one with a period:
column '1:,, ' 2:,.
Select input columns 1 and 2, enclosing the first output column in parentheses (that is, prefixing it with '(' and suffixing it with ')') and enclosing the second one in square braces:
column '1:(,)' '2:[,]'
Select input columns 1, 3, 5, 7, and 9, suffixing all but the last with a comma and a space:
column '1,3,5,7:,, ' 9
Select columns named ``a'' and ``b'', under the assumption that the first line in the file is a header containing the column names:
column -n a -n b
Select columns named ``a'', ``b'' and ``c'', a bit more conveniently, but with the additional proviso that the input must appear on stdin:
column -N a b c
Print a bunch of lines like ``Now is the time for all good men to come to the aid of their party'', with key words taken from the input (i.e. Mad Libs style):
column -fmt 'Now is the time for all $1 $2 to come to the aid of their $3.'With the input
good men party little babies playpen true hackers codebase tall giraffes savannahthis would print
Now is the time for all good men to come to the aid of their party. Now is the time for all little babies to come to the aid of their playpen. Now is the time for all true hackers to come to the aid of their codebase. Now is the time for all tall giraffes to come to the aid of their savannah.(Note that single quotes around the fmt are typically required in this situation, to protect the $'s in fmt from interpretation by the shell.)
Under -m, any -e, -a, and -fi flags apply across all input files; there's no way to provide different column specification for different input files.
The fixed input column specification mechanism -fi m-n, the fixed output column specification mechanism m:n1,n2, and the output prefix/suffix mechanism m:str1,str2 are all pretty dreadfully cumbersome to use and don't really carry their own weight. (To be honest, I put these features in out of a misguided sense of completeness, and I hardly ever use them myself. For formatted output, -fmt fmtstr is much more convenient.)
There is no way to have a comma as an output column prefix. There is no way to have output column prefix or suffix strings which are numeric.
The input quoting mechanism (-q) works properly only for simple quotes strictly surrounding the column data; it does not handle internal quotes (e.g. doubled, as in CSV files) or shell-style partial quoting and implicit concatenation (e.g. something like "a b"c).
The m-n column-selection notation does not work if either m or n involves a $.
I wrote this program because (a) I didn't have access to awk(1) at the time (I was stranded in a godforsaken MS-DOS environment), and (b) I was working with files with lines hundreds of columns and thousands of characters long, so avoiding built-in limits was a must.
This documentation corresponds to version 2.6 of the program.
See
http://www.eskimo.com/~scs/src/#column
for possible updates.
Steve Summit, scs@eskimo.com