Effective AWK Programming

A User's Guide for GNU Awk

Edition 1.0.3

February 1997

Arnold D. Robbins


These commands are available on POSIX compliant systems, as well as on traditional Unix based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.


Often, these systems use gawk for their awk implementation!


The `#!' mechanism works on Linux systems, Unix systems derived from Berkeley Unix, System V Release 4, and some System V Release 3 systems.


The line beginning with `#!' lists the full file name of an interpreter to be run, and an optional initial command line argument to pass to that interpreter. The operating system then runs the interpreter with the given argument and the full argument list of the executed program. The first argument in the list is the full file name of the awk program. The rest of the argument list will either be options to awk, or data files, or both.


In POSIX awk, newlines are not considered whitespace for separating fields.


The sed utility is a "stream editor." Its behavior is also defined by the POSIX standard.


The internal representation uses double-precision floating point numbers. If you don't know what that means, then don't worry about it.


In POSIX awk, newline does not count as whitespace.


Some early implementations of Unix awk initialized FILENAME to "-", even if there were data files to be processed. This behavior was incorrect, and should not be relied upon in your programs.


Computer generated random numbers really are not truly random. They are technically known as "pseudo-random." This means that while the numbers in a sequence appear to be random, you can in fact generate the same sequence of random numbers over and over again.


This consequence was certainly unintended.


As of February 1997, with final approval and publication hopefully sometime in 1997.


A program is interactive if the standard output is connected to a terminal device.


Occasionally there are minutes in a year with a leap second, which is why the seconds can go up to 60.


This is because ANSI C leaves the behavior of the C version of strftime undefined, and gawk will use the system's version of strftime if it's there. Typically, the conversion specifier will either not appear in the returned string, or it will appear literally.


If you don't understand any of this, don't worry about it; these facilities are meant to make it easier to "internationalize" programs.


Not recommended.


Your version of gawk may use a directory that is different than `/usr/local/share/awk'; it will depend upon how gawk was built and installed. The actual directory will be the value of `$(datadir)' generated when gawk was configured. You probably don't need to worry about this though.


Some implementations of awk do not allow you to execute next from within a function body. Some other work-around will be necessary if you use such a version.


ASCII has been extended in many countries to use the values from 128 to 255 for country-specific characters. If your system uses these extensions, you can simplify _ord_init to simply loop from zero to 255.


This is the Epoch on POSIX systems. It may be different on other systems.


Examine the code in section Noting Data File Boundaries. Why must wc use a separate lines variable, instead of using the value of FNR in endfile?


On older, non-POSIX systems, tr often does not require that the lists be enclosed in square brackets and quoted. This is a feature.


This program was written before gawk acquired the ability to split each character in a string into separate array elements. How might this ability simplify the program?


"Real world" is defined as "a program actually used to get something done."


On some very old versions of awk, the test `getline junk < t' can loop forever if the file exists but is empty. Caveat Emptor.


The path may use a directory other than `/usr/local/share/awk', depending upon how gawk was built and installed.


In POSIX awk, newline does not separate fields.

This document was generated on 2 June 1997 using the texi2html translator version 1.51. Hacked to generate 8.3 filenames by Luis Hernández.