World Wide Web CGI (Common Gateway Interface) Programming in C

When the contents of a web page are not static, but rather are generated on-the-fly as the page is fetched, a program runs to generate the contents. When that program requires input from the client who is actually fetching the page (input such as the selections made when filling out a form) that input is propagated to the program via the Common Gateway Interface, or CGI.

Programs which generate web pages on-the-fly (commonly called ``CGI programs'') can be written in any language. Perl, an interpreted or scripting language, is currently the language of choice for writing CGI programs (also called ``CGI scripts'' if they're written in a scripting language), but it's eminently feasible, not difficult, and possibly advantageous to write them in C, which the rest of this handout will describe how to do.

The basic operation of a CGI program is quite simple. Since its job is to generate a web page on the fly, that's just what it must do, by ``printing'' text to its standard output. When we're writing a CGI program in C, therefore, we'll be calling printf a lot, to generate the text we want the virtual page (that is, the one we're building) to contain. Some of the printf calls will print constant or ``boilerplate'' text; for example, the first few lines of almost any C-written CGI program will be along the lines of

	printf("Content-Type: text/html\n\n");
	printf("<html>\n");
	printf("<head>\n");

The <html> and <head> tags are the ones that appear at the beginning of any HTML page. The first line informs the receiving browser that the page is encoded in HTML. For static pages, that line is automatically added by the web server (often as a function of the file name, i.e. if it ends in .html or perhaps .htm), which is why you never see it in static HTML files. But since CGI programs are generating pages on the fly, they must include that line explicitly. (The content type is almost invariably text/html, unless you're getting really fancy and generating images on the fly. Notice that the first, Content-Type: line is followed by two newlines, creating a blank line which separates this ``header'' from the rest of the HTML document.)

Of course, not all of the printf calls will print constant text (or else the resulting page will essentially be static, and needn't have been generated by a CGI program at all). Therefore, many of the printf calls will typically insert some variable text into the output. For example, if title is a string variable containing a title which has been generated for the generated page, one of the next lines in the CGI program might be

	printf("<title>%s</title>\n", title);

Many of the decisions controlling the output of a particular run of a CGI program will of course be made based on selections made by the requesting user (perhaps by having filled out a form). The most important thing to understand about CGI programming (in fact, the very aspect of CGI programming which gives it its name, that is, the aspect which the CGI specification specifies) is the set of mechanisms by which the user's choices and other information are made available to a CGI program.

One mechanism is by means of the environment. The environment is a set of variables maintained, not by any one program, but by the operating system itself, on behalf of a user or process. For example, on MS-DOS and Unix systems, the PATH variable contains a list of directories to be searched for executable programs. On Unix systems, other common environment variables are HOME (the user's home directory), TERM (the user's terminal type) and MAIL (the user's electronic mailbox).

Environment variables are completely separate from the variables used in programs written in typical programming languages, such as C. When a C program wishes to access an environment variable, it does not simply declare a C variable of the same name, and hope that the value will somehow be magically linked into the program. Instead, a C program must call the library function getenv to fetch environment variables. The getenv function accepts a string which is the name of an environment variable to be fetched, and returns a string which is the contents of the variable. (Environment variables are always strings.) If the requested environment variable does not exist, getenv returns a null pointer, instead.

For example, here is a scrap of code which would print out a user's terminal type (presumably on a Unix system):

	char *termtype;
	termtype = getenv("TERM");
	if(termtype != NULL)
		printf("your terminal type is \"%s\"\n", termtype);
	else	printf("no terminal type defined\n");

Two things to be careful of when calling getenv are

As already mentioned, it returns a null pointer if the requested variable doesn't exist, and this null pointer return must always be checked for; and
You shouldn't modify (scribble on) the string it returns, that is, you should treat it as a constant string.

A web daemon (a.k.a. HTTP server or httpd) which supports CGI passes a host of environment variables to a CGI program. Many of them are somewhat esoteric, and this handout will not describe them all. (See the References at the end.) The ones which are most likely to be of use to ordinary CGI programs are:

REQUEST_METHOD The specific request that the client used to fetch the page (that is, to invoke the CGI program): usually GET or POST.
PATH_INFO Any extra path information that appeared after the program name in the URL that invoked this CGI program.
PATH_TRANSLATED The path information from PATH_INFO, with any ``virtual-to-physical mapping'' performed on it.
SCRIPT_NAME The actual name of the CGI program that is running, as a URL, in case the generated page needs to reference itself.
QUERY_STRING The query being performed by the user. In the case of form processing, the QUERY_STRING variable contains encoded information about each field filled in by the user. (We'll have much more to say about QUERY_STRING in a bit.)
REMOTE_HOST The hostname of the client (if known).
CONTENT_TYPE The HTTP/MIME content type of the attached information, for queries (such as POST) with attached information.
CONTENT_LENGTH The size of the attached information, if any.
HTTP_USER_AGENT An indication of the name and version of the client browser in use.
HTTP_ACCEPT The specification by the client browser of which document types it will accept.

For simple uses of the above-listed variables, it suffices to call getenv and then inspect the returned string (if any). For example, you could generate different output depending on which browser a user is using with code like this:

	char *browser = getenv("HTTP_USER_AGENT");
	if(browser != NULL && strstr(browser, "Mozilla") != NULL)
		printf("I see you're using Netscape, just like everyone else.\n");
	else	printf("Congratulations on your daring choice of browser!\n");

(However, this is not a terribly good example. Building knowledge of specific browsers into web pages and servers is a risky proposition, as there will always be more browsers out there than you've heard of. Doing so also flies in the face of the notion of interoperability, which is one of the very foundations of the Internet. If the protocols are well designed, and if all clients and servers implement them properly, no specific server should ever need to know what kind of client it's talking to, or vice versa.)

Environment variables are not the only input mechanism available in CGI. It's also possible for input to arrive via the standard input, to be read with getchar or the like. In fact, in production CGI programming, the standard input is heavily used, because it's where the user's query information arrives when the retrieval method (that is, REQUEST_METHOD) is POST. Because it allows essentially unlimited-size requests, POST is strongly recommended as the request method for HTML forms. It will be a bit easier, however, to describe and demonstrate the processing of query information if we restrict ourselves to the environment variables (specifically, QUERY_STRING), so our examples will use the request method of GET (which causes request information to be sent as a query string rather than on standard input).

For simple queries (such as <ISINDEX> searches), the QUERY_STRING environment variable simply contains the query string (although it will have been encoded; see below). For more complicated queries, however, and in particular for queries resulting from filled-out and submitted HTML forms, the value of the QUERY_STRING variable is a complicated string, with substructure, containing potentially many pieces of information. The basic syntax of the QUERY_STRING string in this case is

	name=value&name=value&name=value

The string is a series of name=value pairs, separated by ampersand characters. Each name is the name attribute from one of the input elements in the form; the value is either the text typed by the user for an element with a type attribute of text, password, or textarea, or one of the <option> values from a <select> element, or the value attribute of a selected element of type checkbox or radio (or ``on'' for selected checkbox elements without value attributes).

What if some text typed by the user (that is, one of the values) happened to contain an = or & character? That would certainly screw up the syntax of the QUERY_STRING string. To guard against this problem, and to keep the query string a single string, it is encoded in two ways:

All spaces are replaced by + signs.
All = and & characters, and any other special characters (such as + and % characters) are replaced by a percent sign (%) followed by a two-digit hexadecimal number representing the character's value in ASCII. For example, = is represented by %3d, & is represented by %26, + is represented by %2b, % is represented by %25, and ~ is often represented by %7e.

A CGI program must arrange to decode (that is, undo) these encodings before attempting to make further use of the information in the QUERY_STRING variable.

We're going to write a function to centralize the parsing of the QUERY_STRING variable, for three reasons:

Splitting the string into name/value pairs is a bit of a nuisance, and is best isolated from the calling code;
Decoding the + and % encodings is a bit of a nuisance, and is best isolated from the calling code; and
If we do it right, we can arrange that improving our programs later to use POST instead of GET (that is, to retrieve query information from the standard input rather than the QUERY_STRING variable) will require rewriting only this one function, not the rest of the program.

The function we'll write is very simple; its prototype is

	char *cgigetval(char *);

You hand it the name of the value you're looking for, and it either returns the (properly decoded) value, or a null pointer if no value by that name could be found. (This might happen if the program or the calling form were improperly written, or if the user hadn't typed anything or selected anything in the requested input element.) The strings returned by cgigetval are in dynamically-allocated memory, that is, cgigetval calls malloc to get storage for each string it returns, and returns a pointer. This means that

the caller doesn't have to worry about providing a buffer or anything;
the caller can modify the string (as long as no attempt is made to extend it); but
the caller is responsible for freeing the memory, if necessary.

(In some programs, the last requirement could be a significant problem; it might be a real nuisance for the caller to have to remember to free the returned pointers, and the program might run out of memory if callers forgot. In this case, however, it's not likely that we'll run out of memory in any case--CGI programs are and should be short-lived--and the advantages of having cgigetval return dynamically-allocated memory will be real conveniences.)

Before presenting the innards of the cgigetval function, let's look at how we'll use it in a (contrived, but working and illustrative) real example.

Our example CGI program will be called by this web page:

<html>
<head>
<title>CGI test page</title>
</head>
<body>
<form action="/cgi-bin/sillycgi" method="get">
<p>
This page demonstrates simple form and CGI processing.
<p>
Enter a line of text here:
<input type="text" name="textfield">
<p>
Select a transformation to apply to the text:
<br>
<input type="radio" name="edittype" value="reverse">
Reverse
<br>
<input type="radio" name="edittype" value="upper">
Upper-case
<br>
<input type="radio" name="edittype" value="lower">
Lower-case
<p>
Press this button to see the result:
<input type="submit">
<p>
(Or press this button to clear this form:
<input type="reset">)
</form>
</body>

The bulk of this page is a form, with one text input area, one set of three radio buttons, a submit button, and a clear button. The user is supposed to type some text, select a radio button representing one of three transformations to be applied to the text, and finally press the ``Submit'' button. Notice that the four <input> tags all have name attributes; these are the names we'll retrieve the values by. (The names of the three radio button elements are the same; this is how the browser knows to implement the one-of-n functionality on the group of three.)

The form header is

	<form action="/cgi-bin/sillycgi" method="get">

If you try this form out, you may have to modify the URL in the action attribute, depending on how CGI programs are accessed under your server. The method attribute is specified as ``get'', which means that our CGI program will receive the user's form inputs in the QUERY_STRING variable.

Here is the CGI program itself. It's rather small and simple, because cgigetval does most of the hard work, which is as it should be.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern char *cgigetval(char *);

main()
{
char *browser;
char *edittype;
char *text;

printf("Content-Type: text/html\n\n");

printf("<html>\n");
printf("<head>\n");
printf("<title>CGI test result</title>\n");
printf("</head>\n");
printf("<body>\n");

browser = getenv("HTTP_USER_AGENT");

printf("<p>\n");
printf("Hello, ");
if(browser != NULL && strstr(browser, "Lynx") != NULL)
	printf("Lynx");
else if(browser != NULL && strstr(browser, "Mosaic") != NULL)
	printf("Mosaic");
else if(browser != NULL && strstr(browser, "Mozilla") != NULL)
	printf("Netscape");
else	printf("unknown browser");
printf(" user!\n");

printf("<p>\n");

printf("Here is your modified text:\n");
printf("<br>\n");

text = cgigetval("textfield");
edittype = cgigetval("edittype");

if(text == NULL)
	{
	printf("You didn't enter any text!\n");
	}
else if(edittype != NULL && strcmp(edittype, "reverse") == 0)
	{
	reverse(text);
	printf("%s\n", text);
	}
else if(edittype != NULL && strcmp(edittype, "upper") == 0)
	{
	upperstring(text);
	printf("%s\n", text);
	}
else if(edittype != NULL && strcmp(edittype, "lower") == 0)
	{
	lowerstring(text);
	printf("%s\n", text);
	}
else	{
	printf("You didn't select a transformation!\n");
	}

printf("</body>\n");
printf("</html>\n");
}

Just for fun, we inspect HTTP_USER_AGENT and print a slightly different greeting depending on which browser the user seems to be using. Then, we call cgigetval twice, to retrieve the user's text and radio button selections. Depending on the radio button selection, we call one of three different functions to transform the text. (As we mentioned, we're allowed to modify the strings returned by cgigetval, in this case, text.) We also check for either the textfield or edittype values coming back as null pointers, indicating that the user neglected to enter them.

Reversing a string in-place simply involves calling the reverse function from assignment 4. Converting strings to upper- or lower case is also easy; here are functions to do it:

#include <ctype.h>

upperstring(char *str)
{
char *p;

for(p = str; *p != '\0'; p++)
	{
	if(islower(*p))
		*p = toupper(*p);
	}
}

lowerstring(char *str)
{
char *p;

for(p = str; *p != '\0'; p++)
	{
	if(isupper(*p))
		*p = tolower(*p);
	}
}

The isupper, islower, toupper, and tolower functions are all in the standard C library, and are declared in the header file <ctype.h>. They operate on characters, with the obvious results. (Actually, on a modern, ANSI-compatible system, the calls to isupper and islower are not required, but they don't hurt much. Older systems required them.)

Finally, here is the code for cgigetval. It is somewhat long and involved, partly because it has a significant amount of work to do, partly because of some stubborn details of the sort that sometimes crop up in real-world programs.

#include <stdlib.h>
#include <string.h>
#include <ctype.h>

char *unescstring(char *, int, char *, int);

char *cgigetval(char *fieldname)
{
int fnamelen;
char *p, *p2, *p3;
int len1, len2;
static char *querystring = NULL;
if(querystring == NULL)
	{
	querystring = getenv("QUERY_STRING");
	if(querystring == NULL)
		return NULL;
	}

if(fieldname == NULL)
	return NULL;

fnamelen = strlen(fieldname);

for(p = querystring; *p != '\0';)
	{
	p2 = strchr(p, '=');
	p3 = strchr(p, '&');
	if(p3 != NULL)
		len2 = p3 - p;
	else	len2 = strlen(p);

	if(p2 == NULL || p3 != NULL && p2 > p3)
		{
		/* no = present in this field */
		p += len2;
		continue;
		}
	len1 = p2 - p;

	if(len1 == fnamelen && strncmp(fieldname, p, len1) == 0)
		{
		/* found it */
		int retlen = len2 - len1 - 1;
		char *retbuf = malloc(retlen + 1);
		if(retbuf == NULL)
			return NULL;
		unescstring(p2 + 1, retlen, retbuf, retlen+1);
		return retbuf;
		}

	p += len2;
	if(*p == '&')
		p++;
	}

/* never found it */

return NULL;
}

cgigetval presents a perfect example of a topic which we mentioned in passing but never gave any realistic examples of: local static variables. cgigetval's job is to parse the value of the QUERY_STRING environment variable, but the value of that variable will remain constant over one run of a program calling cgigetval. Therefore, we only need to call getenv once, the first time cgigetval is called. Thereafter, we can continue to use the previous value. cgigetval maintains the value of QUERY_STRING in a local static variable, querystring. This variable is initialized to NULL, so the first time cgigetval is called, it notices that querystring is null, and calls getenv to properly initialize it. The whole point of a local static variable is that it retains its value from call to call, so on subsequent calls to cgigetval, querystring will not be null, and there will be no need to call getenv again.

The main loop in cgigetval loops over each name/value pair in the query string, one pair at a time. The standard library strchr function (declared in <string.h>) is used to locate the next = and & characters in the string. It would be easier to overwrite the & and/or = characters with '\0', to isolate the name and value substrings as true 0-terminated strings, but we can't modify querystring (which was obtained from getenv, and which we'll be using again next time we're called). Instead, we compute the lengths of the entire name/value pair we're looking at and the name within that pair (in len2 and len1 respectively), and use strncmp to compare each name against the name we're looking for. (strncmp is a standard library function like strcmp, except that it only compares the first n characters.)

When (if) we find a matching name, it's time to decode and return the value. First, we call malloc to allocate the return buffer, remembering to add 1 to the length, to leave room for the terminating \0. We conservatively estimate the size of the buffer we'll need by using the length of the encoded string. If there are any % encodings, this means we'll allocate more than we need (the decoded string will end up shorter than the encoded string), but it's too hard to figure out in advance exactly how big a buffer we'd need, and in this case, the slight amount of wasted memory should be inconsequential.

If we don't find a matching string, we take another trip through the main loop. Each trip through the loop, p points at the remainder of the string we're to examine. After discarding a name/value pair, therefore, we add len2 to p, which either moves it to the & or to the \0 at the very end of query_string. If we're now pointing at an &, we increment p by one more so that it points to the first character of the next name/value pair.

Here is the unescstring function. It accepts a pointer to the encoded string, the length of the encoded string, a pointer to the location to store the decoded result, and the maximum size of the decoded result (so that it can double-check it's not overflowing anything). We pass the encoded string in as a pointer plus a length, rather than as a 0-terminated string, because in most cases it will actually be an unterminated substring, and the length we receive will be such that we stop decoding just before the & that separates one name/value pair from the next.

static int xctod(int);

char *unescstring(char *src, int srclen, char *dest, int destsize)
{
char *endp = src + srclen;
char *srcp;
char *destp = dest;
int nwrote = 0;

for(srcp = src; srcp < endp; srcp++)
	{
	if(nwrote > destsize)
		return NULL;
	if(*srcp == '+')
		*destp++ = ' ';
	else if(*srcp == '%')
		{
		*destp++ = 16 * xctod(*(srcp+1)) + xctod(*(srcp+2));
		srcp += 2;
		}
	else	*destp++ = *srcp;
	nwrote++;
	}

*destp = '\0';

return dest;
}

static int xctod(int c)
{
if(isdigit(c))
	return c - '0';
else if(isupper(c))
	return c - 'A' + 10;
else if(islower(c))
	return c - 'a' + 10;
else	return 0;
}

The basic operation is simple: copy characters one by one from the source to the destination, converting + signs to spaces, converting % sequences, and passing other characters through unchanged. It's mildly tricky to convert a % sequence to the equivalent character value. Here, we use an auxiliary function xctod which converts one hexadecimal digit to its decimal equivalent. (xctod is declared static because it's private to unescstring and to this source file.) xctod converts the digit characters 0-9 to the values 0-9, and a-f or A-F to the values 10-15. (It does so by subtracting character constants which provide the correct offsets, without our having to know the particular character set values.)

It would also be possible to convert the hexadecimal digits by calling sscanf, and thus dispense with xctod. The code would look like this:

	else if(*srcp == '%')
		{
		int x;
		sscanf(srcp+1, "%2x", &x);
		*destp++ = x;
		srcp += 2;
		}

The %2x in the sscanf format string means to convert a hexadecimal integer that's exactly two characters long.

And that's our introduction to writing CGI programs in C! I encourage you to type in the example and play with it (assuming you have access to a web server where you can install CGI programs at all, of course.) If you would like more information on CGI programming, you can read the page http://hoohoo.ncsa.uiuc.edu/cgi/overview.html (which is where I learned everything I've presented here).