med is a filter which reads a stream consisting of columns of data, performs mathematical (or perhaps string or other) operations on selected values, and writes out a stream of columns of modified data. The data is normally numeric (specifically, double-precision floating-point), but some versions of the program are able to additionally manipulate plain integers, strings, arbitrary-precision integers (``bigints''), and/or date/timestamps.
One column of output is generated for each expression present on the command line. The input is taken from either the named files, the file named by -if, or (if neither of these appear) the standard input.
The expression syntax is an amalgam of FORTRAN and C. The +, -, *, /, % (modulus) and ** (exponentiation) operators are supported (with the customary associativity and precedence), as well as unary -, and parentheses for grouping. Relational and logical operators are also supported, both FORTRAN-style .gt., .ge., .lt., .le., .eq., .ne., .and., .or., and .not., and C-style >, >=, <, <=, ==, !=, &&, ||, and !. The C bitwise operators ~, &, |, ^, <<, and >> are also supported.
The following built-in math functions are supported:
|acos||ln (natural log)|
|asin||log10 (common log)|
|atan2 (two arguments)||sinh|
Some versions of the program additionally support these C-like string-handling functions:
strcat strlen strstr substrstrcat(s1, s2) returns a new string which is the concatenation of s1 and s2. strlen(str) returns the length of str. strstr(s1, s2) returns the 1-based position within s1 of the substring s2 (if any, or 0 if not). substr(str, m, n) returns the n-character substring of str beginning at position m (1-based). (There is no strcmp function, because the relational operators-- .lt., >=, etc.-- do work on strings.)
Some versions of the program additionally support these regular expression (``regexp'') functions:
strmatch strsubststrmatch(str, pat) returns ``true'' (nonzero) if the string str is matched by the regular expression pat. strsubst(str, pat, rep) returns a copy of str with the first occurrence of the regular expression pat (if any) replaced by rep (including & and \digit substitution). (Details of the regular expressions supported, including nuances such as whether \( \) or ( ) are used for substring matches, may vary depending on the underlying regular expression library in use.)
Access to input data, as well as a few useful constants, is through identifiers (``variables''). The following identifiers may be present in expressions:
(It is also possible to use arbitrary names for the columns, rather than c1, c2, c3, etc.; see the -N option below.)
An input column can be passed through to the output unchanged with a trivial expression such as ``c1''. (Also, even in versions without full string handling, as a special case, the input data in a column which is used only with such an expression is not required to be numeric, and is passed through verbatim even if it is alphabetic.) The input may contain more columns than are called for by the expressions; unused input columns are silently discarded. Blank lines, as well as those beginning with a comment character (by default, `#', but settable with -cc) are passed through to the output unchanged.
Limited control flow is provided via a conditional pseudofunction, ``if''. The value of the expression
if(e1, e2, e3)is e2 if e1 is nonzero (namely, ``true''), or e3 if e1 is zero (``false'').
Normally, med reads one line, evaluates the requested expressions for the data on that line, prints one line containing the results, and continues on to the next line. However, a few special-purpose function-like operators modify this behavior. There are six of these ``summarizing'' functions:
min max sum product mean stdevWhen a ``summarizing:'' function is used, a value is accumulated, which is printed out only after a file's input lines have all been read. Line-by-line output is suppressed. (See the examples below.) When summarizing functions are used with multiple input files, one line of output is generated for each file, individually summarizing each file's data. (See also the -bt option below.)
The arguments of the ``summarizing'' functions can be arbitrarily complex expressions, and the results can be further operated upon before printing. (That is, compound expressions such as max(c1+c2) and max(c1)+max(c2) are permitted.)
Actually, the min() and max() functions exist in two different forms. With a single argument, they operate over multiple input lines, in ``summarizing'' mode, as described above. However, when invoked with two or more arguments, they compute the minimum or maximum of those arguments, immediately (more or less as in FORTRAN).
Output values are printed in an appropriate format--usually printf(3)'s %g format for numeric data. It is also possible to override the default, and specify a format explicitly, either by (a) using the -fmt option described below, or (b) suffixing an expression with an at sign `@' and a printf(3) (or other) format specifier, which will be applied to that expression's output only.
The -b (``bunch tallies'') option indicates that the summarizing functions should generate output after each group of numbers in the input stream. Groups of input numbers are separated by one or more blank lines. After printing the result from each group, all accumulated counts are zeroed before processing the next group.
The -cc (``comment character'') option indicates that the next argument is to be taken as the (single) character introducing comments in data and expression files. By default, the comment character is `#'.
The -ef (``expression file'') option indicates that the next argument is the name of a file out of which expressions are to be read. The file is assumed to contain one expression per line. Comments may appear in the expression file on lines beginning with `#' (or the comment character set with -cc). Expressions within expression files are immune from any unwanted interactions or restrictions imposed by the shell, and may therefore contain whitespace and `*' (which is presumed to be the multiplication operator, rather than a wildcard filename) with impunity. Also, use of -ef supersedes med's attempt to parse expressions from the command line at all.
(``format) option indicates that the next argument is a
printf(3)-style format string
with which numeric output should be printed.
(The default is %g.)
(In versions of the program that support multiple datatypes, other format specifiers may be possible, such as strftime(3)-style for date/timestamps.)
(See also the `@' notation discussed above.)
In versions of the program that support multiple datatypes, two options permit adjustment of some assumptions surrounding integral and floating-point types. The -fp option indicates that all input numbers (even those without explicit decimal points) should be treated as floating-point. The -i option indicates that division of integers should be truncating, integer division. (By default, division always generates floating-point results, if appropriate. That is, in the absence of -i, 1/2 is 0.5.)
The -if (``input file'') option indicates that the next argument is the name of a file from which input data will be read. (Input filenames may also appear as arguments on the command line. If neither -if nor standalone input file arguments appear, input is read from the standard input.)
The -n (``annotate output'') option indicates that, when ``summarizing'' functions are being used and multiple files are being read, each output line should be preceded by the originating file name.
The -N option indicates that the first nonblank line of the input is a header giving names for the columns. These are therefore the names used in the output expressions, rather than c1, c2, c3, etc.
The -t option indicates that the next argument is to be taken as the (single) character separating columns in the input. By default, input columns are separated by arbitrary whitespace. (For those who are more used to awk(1) than sort(1), the -F option is also accepted with the same meaning.)
A brief summary of the invocation syntax and accepted options may be requested with -help. The -version option prints the program's version number.
1. From a stream of two columns, print four columns consisting of the sum, difference, product, and quotient of the two input columns:
med c1+c2 c1-c2 'c1*c2' c1/c2With this invocation, the input
1 2 3 4 5 6 7 5 3 1would produce
3 -1 2 0.5 7 -1 12 0.75 11 -1 30 0.833333 12 2 35 1.4 4 2 3 3
2. Print the mean and standard deviation of a series of numbers:
med 'mean(c1)' 'stdev(c1)'The input
1 2 3 4would produce the single line
3. Compute the mean and standard deviation of a column of numbers the ``hard way'':
med 'sum(c1)/n' 'sqrt((sum(c1**2)-sum(c1)**2/n)/(n-1))'This example would generate the same output as the previous one. (In fact, the built-in mean and stdev functions are implemented internally with exactly these latter expressions.)
4. Print the maximum of column 1, the maximum of column 2, and the maximum of column 3:
med 'max(c1)' 'max(c2)' 'max(c3)'The input
1 5 2 8 9 6 4 7 3would produce the line
8 9 6
5. Print (on each line) the maximum of columns 1, 2, and 3:
med 'max(c1, c2, c3)'The input of the previous example would generate
5 9 7
6. Print the maximum of the maximum of columns 1, 2, and 3:
med 'max(max(c1), max(c2), max(c3))'The input of the previous example would generate the single number 9.
Not all possible floating-point errors are handled gracefully.
C and FORTRAN differ in the naming of logarithmic functions. This program uses ln for natural log, and log10 for ``common'' or base-10 log. (Plain log is also accepted, and implements common log, following FORTRAN. In C, log() is a natural log.)
It's probably a bad idea to have min() and max() do two such very different things depending simply on whether they're invoked with single or multiple arguments.
In versions of the program that handle multiple datatypes (e.g. bigints, date/timestamps, etc.), if operands of different types are mixed in the same expression, not all of the appropriate implicit type conversions are yet supported.
In versions of the program that handle multiple datatypes, when the -fmt option or @ notation is used to select an output format, and depending on the actual type of the result, it is not always possible to use the requested format exactly.
Because singleton column-selection expressions such as ``c1'' are special-cased, causing the input to be passed through to the output unchanged, an attempt to use the -fmt option to simply change the base or format of a column of numbers may fail. This problem can be worked around by using a do-nothing expression such as ``c1+0''.
In versions of the program that handle multiple datatypes, constants within expressions can be problematical if the syntax of a constant of an exotic datatype is baroque. For example, when working with date/timestamps, the attempted expression
c1 + 1:00is an unparseable syntax error. (Although the expression evaluation machinery knows about date/timestamps, the lexical analyzer does not.) To work around this difficulty, an experimental new quoting mechanism has been introduced: a pair of backquotes (also known as grave accents) indicates a constant which is to be interpreted as an extended datatype. The above example could be successfully rendered as
c1 + `1:00`(Note, however, that backquotes are special to the shell and are thus another reason to quote med's expressions.) As mentioned, this backquote mechanism is experimental and may not persist in eventual evolutions of the program.
Since the input is read only once, an expression like
c1/max(c1)which attempts to apply a summarizing function to each line of input, will not work.
The standard deviation function, since it is implemented in terms of running totals of x and x**2 (rather than directly from the definition), can suffer numerical accuracy problems.
Since med's datatypes are implicit and their handling currently somewhat approximate, it is not at all clear what the wordsize for the bitwise ~ operator should be. (For this reason, bitwise complement of bigints is not yet supported, although you can achieve the same effect by XORing with an all-1's mask of the desired size.)
The input is limited to 20 columns. There is also a limit on the number of embedded constants in expressions (33-36 distinct values).
When using -N (and, for that matter, when not using -N) there is no way to specify headings of any kind on the output columns.
Specifying expressions on the command line is awkward. Unless they are quoted, individual expressions must be devoid of whitespace, and operators such as `*' can cause difficulties with shell wildcarding. (It is safest, therefore, to routinely enclose each expression in single quotes.) Placing expressions in a separate file, and using -ef, avoids these difficulties, at the cost of the inconvenience of that separate file.
The lexical distinction between option flags, expressions, and input filenames is subtle if not downright ambiguous. It is safer to use -ef, or -if, or shell input redirection (that is, using <), rather than mixing expressions and filenames on the command line. (In the presence of command line expressions, an attempted filename argument will definitely not work if the name is numeric, or otherwise looks like an expression.)
The command line option parser is even more so than usually baroque. Single-character options cannot be combined (e.g. -bn is not a substitute for -b -n).
There are no user-definable variables or functions.
It could be argued that this program is not sufficiently more useful than awk(1) to warrant its existence.
Steve Summit, firstname.lastname@example.org
med was written at a time when I was enduring the privations of an old MS-DOS environment, and didn't even have the option of using awk.
This documentation corresponds to version 2.9 of the program. See http://www.eskimo.com/~scs/src/#med for possible updates.