COLUMN(1)

NAME

column - extract columns from file or stream

SYNOPSIS

column [ options ] columns [ files ]

DESCRIPTION

column extracts columns from standard input or named files. By default, input columns are separated by whitespace, but it is also possible to specify columns by character counts, or by specific punctuation delimiters.

As a simple example, the invocation

	column 1 3 5 inputfile

would print columns 1, 3, and 5 of the file named inputfile. (In other words, it would do the same thing as the simple awk(1) script

	awk '{print $1, $3, $5}' inputfile

If no filenames are given, column reads from standard input. Also, a filename of ``-'' indicates standard input.

Because multiple column numbers are entered as separate arguments, there is an ambiguity if an input filename has a name which looks like a number. To resolve the ambiguity, use an alternative pathname for the file which does not begin with a digit. The simplest way to do so is to precede a numeric filename with ``./''.

column can work with several definitions of what a ``column'' is. Input columns separated by whitespace or other delimiter characters are referred to as ``floating'' columns. Input columns specified by character counts are referred to as ``fixed'' columns. Furthermore, floating columns can be delimited in two different ways. Sometimes, particularly when columns are delimited by whitespace, multiple adjacent instances of the delimiter character(s) should count for just one column separation. Other times, when columns are delimited by punctuation characters such as commas, colons, or vertical bars, multiple adjacent instances of the delimiter character should imply the presence of one or more empty columns. (column can handle both of these situations.)

column uses dynamically-allocated memory for input lines and column descriptors, and can therefore be used on input lines with thousands of characters and hundreds of columns (or more).

OPTIONS

-a chrs: Specify delimiter characters which separate floating input columns. Any number of these characters may appear between columns, that is, multiple of these characters do not indicate multiple columns. By default, column's behavior is as if the -a option had been used to select space and tab as column separator characters.
-c chr: Set input file comment character. Lines beginning with the comment character are passed through verbatim; column extraction on those lines is not performed.
-e chrs: Specify a delimiter character which separates floating input columns exactly. One instance of this character appears between each pair of columns, that is, adjacent delimiter characters indicate an empty column. The -e option is useful when working with files containing values separated by commas, colons, vertical bars, etc.
-fi m-n: Define a fixed input column running from character positions m to n. (Note that -fi merely defines an input column; it does not select it for printing.)
-fmt fmtstr: Specify an output format fmtstr in which the notations $1, $2, $3 (etc.) are interpolated as columns 1, 2, 3, and so on. (See further description under OUTPUT COLUMN SPECIFICATION below.)
-m: Permit multiple interspersed files and columns: additional column selectors following the first input filename on the command line request a different set of columns to be selected from an upcoming filename. (See examples below.)
-n name: Select column by name (where input column names are described by the first line in the file).
-N: Select many columns by name--all names on the command line are treated as column names, as if requested with -n. The input must therefore appear on the standard input. (No files will be opened, since no filenames can be specified.)
-p: Preserve input column separators: each output column is followed by (and therefore separated from the next output column by) whatever set of delimiter characters followed it in the input. (By default, output columns are separated by tabs.)
-q: Look for quotes around floating input columns, and do not recognize whitespace or other delimiters between quotes.
-qo: Put quotes around output columns if necessary to protect column contents which might otherwise be interpreted as delimiters.
-v: Invert; print all columns except those explicitly selected.
-?,-h: Print a brief help message.

INPUT COLUMN SPECIFICATION

By default, input columns are floating and are separated by whitespace, that is, by one or more spaces or tabs. In general, floating input columns are defined by two kinds of delimiters: ``exact'' delimiters and ``any'' delimiters. Multiple adjacent instances of an ``exact'' delimiter indicate multiple (empty) columns, while multiple adjacent instances of the ``any'' delimiters indicate a single column division. The default, whitespace-separating behavior is therefore achieved by using an ``any'' delimiter set consisting of the space and tab characters, and an ``exact'' delimiter set which is empty. To select a specific ``exact'' character (or characters), use -e. To select a different set of ``any'' characters, use -a.

Any leading instances of the ``any'' characters on an input line are ignored; they do not indicate the presence of an initial empty column. In fact, there are never any empty columns when only ``any'' characters are used; the only way to achieve empty floating columns is by using leading, trailing, or adjacent ``exact'' characters.

``Exact'' and ``any'' characters may be used simultaneously: for example, using -e to select a comma as the ``exact'' column separator, while leaving the ``any'' delimiter set as the default whitespace, would mean that whitespace at the beginning or end of a comma-separated column would be stripped, and would not appear in the column contents. (Stated another way, though comma is the ``real'' column separator, whitespace surrounding commas is not significant and is not taken to be part of either column. Stated yet another way, input columns would be assumed to be separated by exactly one comma, and zero or more spaces or tabs.) To disable the default ``any'' delimiter characters (that is, to arrange that all input whitespace does appear explicitly in input columns), use -a with an empty argument:

	-a ''

It is also possible to specify, with the -q option, that the input consists of floating columns where some column data may contain whitespace or delimiter characters, protected by quotes. (See the examples below.)

Fixed input columns are defined using the -fi option. One -fi option describes one input column; in general, many -fi options will be used to describe the complete input format. The -fi options do not select input columns for printing; they only describe the input columns. The columns to be selected and printed must be requested using numeric arguments, just as for floating columns.

COLUMN SELECTION

When selecting output columns, several notations may be used. The most basic output column selectors are individual numeric arguments, as in

	column 1 3 5 file

Column numbers can also be separated by commas:

	column 1,3,5 file

The notation m-n specifies a range of columns:

	column 2-4 file

These notations may be combined reasonably arbitrarily:

	column 1,3 5-7 9,11-13 file
Columns can also be counted from the right edge
(that is, from the end of the line).
The dollar sign $ is a marker indicating the last column,
and the notation $-n indicates columns counted from the right.
So
$
indicates the last column,
$-1
indicates the next-to-last column,
$-2
indicates the third-to-last column,
etc.

Right-based columns are counted on a line-by-line basis, so the invocation

	column $-1 $

on a file containing the lines

	a b c
	d e f g
	h i

would result in the output

	b c
	f g
	h i

Rather than specifying columns by number, it is possible to specify them by name, if the input file is self-describing by having as its first line a header denoting the column names. The -n option selects an output column by name; multiple -n options are used to select multiple columns. For example, given the input file

the invocation

	column -n b -n d

would select

When the first line is being used as a header, its columns are determined using the same rules as for the remaining ``data'' lines. The header line is processed--columns selected from it and printed--just as for the remaining ``data'' lines, so the first line ends up being a self-describing header for the output, as shown in the example above.

When columns are requested by name using -n, and when simultaneously a comment character is requested using -c, the first line is taken as the column definition line even if it is commented. (Furthermore, if the first, column-definition line is commented, any whitespace between the comment character and the first column name is ignored. That is, if the comment character is #, the first lines ``a b c'', ``#a b c'', and ``# a b c'' would all be treated identically, and would describe a file with three columns named ``a'', ``b'', and ``c''.)

For convenience when requesting many columns by name, the -N option requests that all names appearing on the command line be treated as column names (as if with -n), at the cost of constraining the input to be read from the standard input, rather than a named file.

OUTPUT COLUMN SPECIFICATION

It is possible to control the way columns, once selected, are printed. By default, they are simply separated by tab characters. The -p option requests that they be separated by whatever delimiters separated them in the input. The -qo option requests that output columns be quoted, if necessary, to prevent delimiter characters in the column data being output from being interpreted as column delimiters. (That is, -qo prepares column's output to be parsed by some other program which understands quoted columns.)

It is possible to define output columns which should appear at fixed character positions, or which are delimited by specific strings. These output column specifications are made by appending additional information to the selectors which request the columns. For any number m on the invocation command line which requests that column m be selected and printed, the following notations may be used:

m:n: the output column should begin at character position n.
m:,n: the output column should end at character position n (i.e. right justified).
m:n1,n2: the output column should begin at character position n1 and end at character position n2 (with the column data being truncated if it's too big to fit).
m:str: The output column should be prefixed with str.
m:str1,str2: The output column should be prefixed with str1 and suffixed with str2.
m:,str: The output column should be suffixed with str.

To describe a number of similar output columns, the above notations may be combined with the m-n column selection notation. Furthermore, it is also possible to specify a group of disjoint output columns, separated by commas, to which a single output column description notation is attached. See the examples below.

Finally, it is possible to generate arbitrarily-formatted output lines, using the -fmt fmtstr option. This option dispenses with all the other output column specification mechanisms (and, for that matter, it provides its own input column selection mechanism as well). The fmtstr is a skeleton template describing each output line, and in which the notation $n is replaced by the contents of column n. See the example below.

EXAMPLES

Select columns 1, 3, and 5, with columns separated by arbitrary whitespace:

	column 1 3 5

Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace:

	column -e , 1 3 5

Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace, but with quotes protecting whitespace or commas which should appear in the columns themselves:

	column -q -e , 1 3 5

(This is essentially ``CSV'' format.)

Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace, with quotes protecting whitespace or commas which should appear in the columns themselves, and with the output columns protected by quotes if necessary:

	column -q -qo -e , 1 3 5

Select columns 1, 3, and 5, with input columns separated by colons and without stripping any whitespace:

	column -e : -a '' 1 3 5

(This would be useful for parsing UNIX passwd files or related files.)

Select columns 1, 3, and 5, with input columns separated by tabs and without stripping any whitespace:

	column -e '	' -a '' 1 3 5

(The character between the single quotes following the -e option is a single tab. This is essentially ``TDF'' format.)

Define input columns running from character positions 1-5, 6-10, 11-20, and 21-50, and print the second and fourth columns:

	column -fi 1-5 -fi 6-10 -fi 11-20 -fi 21-50 2 4

Print the first and last columns:

	column 1 $

Print the first two and last two columns:

	column 1 2 '$-1' $

Print columns 1, 5, and 10 through 20:

	column 1 5 10-20

Print all but columns 1, 5, and 10 through 20:

	column -v 1 5 10-20

Print columns 1 and 3 (whitespace delimited) from file a, followed by 2 and 4 from file b:

	column -m 1 3 a 2 4 b

Print columns 1 and 3 from file a, with column 2 from file b interspersed (that is, print column 1 from file a, followed by column 2 from file b, followed by column 3 from file a again):

	column -m 1 a 2 b 3 a

Print columns 1 and 3 from standard input, with column 2 from file b interspersed:

	column -m 1 - 2 b 3 -

Select column 1 and print it beginning at output position 1, and column 3 beginning at output position 10:

	column 1:1 3:10

Select input columns 1 and 3, printing them in output columns in positions 1-9 and 11-20:

	column 1:1,9 3:11,20

Select input columns 1 and 2, suffixing the first output column with a comma and a space and the second one with a period:

	column '1:,, ' 2:,.

Select input columns 1 and 2, enclosing the first output column in parentheses (that is, prefixing it with '(' and suffixing it with ')') and enclosing the second one in square braces:

	column '1:(,)' '2:[,]'

Select input columns 1, 3, 5, 7, and 9, suffixing all but the last with a comma and a space:

	column '1,3,5,7:,, ' 9

Select columns named ``a'' and ``b'', under the assumption that the first line in the file is a header containing the column names:

	column -n a -n b

Select columns named ``a'', ``b'' and ``c'', a bit more conveniently, but with the additional proviso that the input must appear on stdin:

	column -N a b c

Print a bunch of lines like ``Now is the time for all good men to come to the aid of their party'', with key words taken from the input (i.e. Mad Libs style):

	column -fmt 'Now is the time for all $1 $2 to come to the aid of their $3.'

With the input

	good	men	party
	little	babies	playpen
	true	hackers	codebase
	tall	giraffes	savannah

this would print

	Now is the time for all good men to come to the aid of their party.
	Now is the time for all little babies to come to the aid of their playpen.
	Now is the time for all true hackers to come to the aid of their codebase.
	Now is the time for all tall giraffes to come to the aid of their savannah.

(Note that single quotes around the fmt are typically required in this situation, to protect the $'s in fmt from interpretation by the shell.)

BUGS

Under -m, any -e, -a, and -fi flags apply across all input files; there's no way to provide different column specification for different input files.

The fixed input column specification mechanism -fi m-n, the fixed output column specification mechanism m:n1,n2, and the output prefix/suffix mechanism m:str1,str2 are all pretty dreadfully cumbersome to use and don't really carry their own weight. (To be honest, I put these features in out of a misguided sense of completeness, and I hardly ever use them myself. For formatted output, -fmt fmtstr is much more convenient.)

There is no way to have a comma as an output column prefix. There is no way to have output column prefix or suffix strings which are numeric.

The input quoting mechanism (-q) works properly only for simple quotes strictly surrounding the column data; it does not handle internal quotes (e.g. doubled, as in CSV files) or shell-style partial quoting and implicit concatenation (e.g. something like "a b"c).

The m-n column-selection notation does not work if either m or n involves a $.

HISTORY

I wrote this program because (a) I didn't have access to awk(1) at the time (I was stranded in a godforsaken MS-DOS environment), and (b) I was working with files with lines hundreds of columns and thousands of characters long, so avoiding built-in limits was a must.

This documentation corresponds to version 2.6 of the program.
See http://www.eskimo.com/~scs/src/#column for possible updates.

AUTHOR

Steve Summit, scs@eskimo.com