COLUMN(1)

NAME

column - extract columns from a file

SYNOPSIS

column [ options ] columns [ files ]

DESCRIPTION

column extracts columns from standard input or named files. By default, columns are separated by whitespace, but it is also possible to specify columns by character counts, or by specific punctuation delimiters.

As a simple example, the invocation

	column 1 3 5 inputfile
would print columns 1, 3, and 5 of the file named inputfile. (In other words, it would do the same thing as the simple awk(1) script
	awk '{print $1, $3, $5}' inputfile
.)

If no filenames are given, column reads from standard input. Also, a filename of ``-'' indicates standard input.

Because multiple column numbers are entered as separate arguments, there is an ambiguity if an input filename has a name which looks like a number. To resolve the ambiguity, use an alternative pathname for the file which does not begin with a digit. The simplest way to do so is to precede a numeric filename with ``./''.

column can work with several definitions of what a ``column'' is. Input columns separated by whitespace or other delimiter characters are referred to as ``floating'' columns. Input columns specified by character counts are referred to as ``fixed'' columns. Furthermore, floating columns can be delimited in two different ways. Sometimes, particularly when columns are delimited by whitespace, multiple adjacent instances of the delimiter character(s) should count for just one column separation. Other times, when columns are delimited by punctuation characters such as commas, colons, or vertical bars, multiple adjacent instances of the delimiter character should imply the presence of an empty column. (column can handle both of these situations.)

column uses dynamically-allocated memory for input lines and column descriptors, and can therefore be used on input lines with thousands of characters and hundreds of columns (or more).

OPTIONS

-a chrs
Specify characters which separate floating input columns. Any number of these characters may appear between columns, that is, multiple of these characters do not indicate multiple columns. By default, column's behavior is as if the -a option had been used to select space and tab as column separator characters.
-c chr
Set input file comment character. Lines beginning with the comment character are passed through verbatim; column extraction is not performed.
-e chrs
Specify a character which separates floating input columns exactly. One instance of this character appears between each pair of columns, that is, multiple adjacent characters indicate multiple columns. The -e option is useful when working with files containing values separated by commas, colons, vertical bars, etc.
-fi m-n
Define a fixed input column running from character positions m to n. (Note that -fi merely defines an input column; it does not select it for printing.)
-m
Permit multiple interspersed files and columns: additional column selectors following the first input filename on the command line request a different set of columns to be selected from an upcoming filename. (See examples below.)
-n name
Select column by name (where input column names are described by the first line in the file).
-N
Select many columns by name--all names on the command line are treated as column names, as if requested with -n. The input must therefore appear on the standard input. (No files will be opened, since no filenames can be specified.)
-p
Preserve input column separators: each output column is followed by (and therefore separated from the next output column by) whatever set of delimiters (``exact'' or ``any'') followed it in the input. (By default, output columns are separated by tabs.)
-q
Look for quotes around floating input columns, and do not recognize whitespace or other delimiters between quotes.
-qo
Put quotes around output columns if necessary to protect column contents which might otherwise be interpreted as delimiters.
-v
Invert; print all columns except those explicitly selected.
-?,-h
Print a brief help message.

INPUT COLUMN SPECIFICATION

By default, input columns are floating and are separated by whitespace, that is, by one or more spaces or tabs. In general, floating input columns are defined by two kinds of delimiters: ``exact'' delimiters and ``any'' delimiters. Multiple adjacent instances of an ``exact'' delimiter indicate multiple (empty) columns, while multiple adjacent instances of the ``any'' delimiters indicate a single column division. The default, whitespace-separating behavior is therefore achieved by using an ``any'' delimiter set consisting of the space and tab characters, and an ``exact'' delimiter set which is empty. To select a specific ``exact'' character (or characters), use -e. To select a different set of ``any'' characters, use -a.

Any leading instances of the ``any'' characters on an input line are ignored; they do not indicate the presence of an initial empty column. In fact, there are never any empty columns when only ``any'' characters are used; the only way to achieve empty floating columns is by using leading, trailing, or adjacent ``exact'' characters.

``Exact'' and ``any'' characers may be used simultaneously: for example, using -e to select a comma as the ``exact'' column separator, while leaving the ``any'' delimiter set as the default whitespace, would mean that whitespace at the beginning or end of a comma-separated column would be stripped, and would not appear in the column contents. (Stated another way, though comma is the ``real'' column separator, whitespace surrounding commas is not significant and is not taken to be part of either column. Stated yet another way, input columns would be assumed to be separated by exactly one comma, and zero or more spaces or tabs.) To disable the default ``any'' delimiter characters (that is, to arrange that all input whitespace does appear explicitly in input columns), use -a with an empty argument:

	-a ''

It is also possible to specify, with the -q option, that the input consists of floating columns where some column data may contain whitespace or delimiter characters, protected by quotes. (See the examples below.)

Fixed input columns are defined using the -fi option. One -fi option describes one input column; in general, many -fi options will be used to describe the complete input format. The -fi options do not select input columns for printing; they only describe the input columns. The columns to be selected and printed must be requested using numeric arguments, just as for floating columns.

COLUMN SELECTION

When selecting output columns, two additional notations may be used. First, the notation m-n specifies a range of columns. Second, the dollar sign $ is a marker indicating the last column, and the notation $-n indicates columns counted from the right. So $ indicates the last column, $-1 indicates the next-to-last column, $-2 indicates the third-to-last column, etc.

Right-based columns are counted on a line-by-line basis, so the invocation

	column $-1 $
on a file containing the lines
	a b c
	d e f g
	h i
would result in the output
	b c
	f g
	h i

Rather than specifying columns by number, it is possible to specify them by name, if the input file is self-describing by having as its first line a header denoting the column names. The -n option selects an output column by name; multiple -n options are used to select multiple columns. For example, given the input file

	a b c d
	1 2 3 4
	5 6 7 8
	9 10 11 12
the invocation
	column -n b -n d
would select
	b d
	2 4
	6 8
	10 12

When columns are requested by name using -n, and when simultaneously a comment character is requested using -c, the first line is taken as the column definition line even if it is commented. (Furthermore, if the first, column-definition line is commented, any whitespace between the comment character and the first column name is ignored. That is, if the comment character is #, the first lines ``a b c'', ``#a b c'', and ``# a b c'' would all be treated identically, and would describe a file with three columns named ``a'', ``b'', and ``c''.)

For convenience when requesting many columns by name, the -N option requests that all names appearing on the command line be treated as column names (as if with -n), at the cost of constraining the input to be read from the standard input, rather than a named file.

OUTPUT COLUMN SPECIFICATION

It is possible to control the way columns, once selected, are printed. By default, they are separated by tab characters. The -p option requests that they be separated by whatever delimiters separated them in the input. The -qo option requests that output columns be quoted, if necessary, to prevent delimiter characters in the column data being output from being interpreted as column delimiters. (That is, -qo prepares column's output to be parsed by some other program which understands quoted columns.)

Finally, it is possible to define output columns which should appear at fixed character positions, or which are delimited by specific strings. These output column specifications are made by appending additional information to the numbers which request the columns. For any number m on the invocation command line which requests that column m be selected and printed, the following notations may be used:

m:n
the output column should begin at character position n.
m:,n
the output column should end at character position n (i.e. right justified).
m:n1,n2
the output column should begin at character position n1 and end at character position n2 (with the column data being truncated if it's too big to fit).
m:str
The output column should be prefixed with str.
m:str1,str2
The output column should be prefixed with str1 and suffixed with str2.
m:,str
The output column should be suffixed with str.

To describe a number of similar output columns, the above notations may be combined with the m-n column selection notation. Furthermore, it is also possible to specify a group of disjoint output columns, separated by commas, to which a single output column description notation is attached. See the examples below.

EXAMPLES

Select columns 1, 3, and 5, with columns separated by arbitrary whitespace:

	column 1 3 5

Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace:

	column -e , 1 3 5

Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace, but with quotes protecting whitespace or commas which should appear in the columns themselves:

	column -q -e , 1 3 5
(This is essentially ``CSV'' format.)

Select columns 1, 3, and 5, with input columns separated by commas and optional whitespace, with quotes protecting whitespace or commas which should appear in the columns themselves, and with the output columns protected by quotes if necessary:

	column -q -qo -e , 1 3 5

Select columns 1, 3, and 5, with input columns separated by colons and without stripping any whitespace:

	column -e : -a '' 1 3 5
(This would be useful for parsing UNIX passwd files or related files.)

Select columns 1, 3, and 5, with input columns separated by tabs and without stripping any whitespace:

	column -e '	' -a '' 1 3 5
(The character between the single quotes following the -e option is a single tab. This is essentially ``TDF'' format.)

Define input columns running from character positions 1-5, 6-10, 11-20, and 21-50, and print the second and fourth columns:

	column -fi 1-5 -fi 6-10 -fi 11-20 -fi 21-50 2 4

Print the first and last columns:

	column 1 $

Print the first two and last two columns:

	column 1 2 '$-1' $

Print columns 1, 5, and 10 through 20:

	column 1 5 10-20

Print all but columns 1, 5, and 10 through 20:

	column -v 1 5 10-20

Print columns 1 and 3 (whitespace delimited) from file a, followed by 2 and 4 from file b:

	column -m 1 3 a 2 4 b

Print columns 1 and 3 from file a, with column 2 from file b interspersed (that is, print column 1 from file a, followed by column 2 from file b, followed by column 3 from file a again):

	column -m 1 a 2 b 3 a

Print columns 1 and 3 from standard input, with column 2 from file b interspersed:

	column -m 1 - 2 b 3 -

Select column 1 and print it beginning at output position 1, and column 3 beginning at output position 10:

	column 1:1 3:10

Select input columns 1 and 3, printing them in output columns in positions 1-9 and 11-20:

	column 1:1,9 3:11,20

Select input columns 1 and 2, suffixing the first output column with a comma and a space and the second one with a period:

	column '1:,, ' 2:,.

Select input columns 1 and 2, enclosing the first output column in parentheses (that is, prefixing it with '(' and suffixing it with ')') and enclosing the second one in square braces:

	column '1:(,)' '2:[,]'

Select input columns 1, 3, 5, 7, and 9, suffixing all but the last with a comma and a space:

	column '1,3,5,7:,, ' 9

Select columns named ``a'' and ``b'', under the assumption that the first line in the file is a header containing the column names:

	column -n a -n b

Select columns named ``a'', ``b'' and ``c'', a bit more conveniently, but with the additional proviso that the input must appear on stdin:

	column -N a b c

BUGS

Under -m, any -fi flags apply across all input files; there's no way to define different fixed columns for different input files.

There is no way to have a comma as an output column prefix. There is no way to have output column prefix or suffix strings which are numeric.

The input quoting mechanism (-q) works properly only for simple quotes strictly surrounding the column data; it does not handle internal quotes (e.g. doubled, as in CSV files) or shell-style partial quoting and implicit concatenation (e.g. something like "a b"c).

The m-n column-selection notation does not work if either m or n involves a $.

SEE ALSO

cut(1), paste(1), awk(1), line

HISTORY

I wrote this program because (a) I didn't have access to awk(1) at the time (I was stranded in a godforsaken MS-DOS environment), and (b) I was working with files with lines hundreds of columns and thousands of characters long, so avoiding built-in limits was a must.

See http://www.eskimo.com/~scs/src/#column for possible updates.

AUTHOR

Steve Summit, scs@eskimo.com