Node:Split Program, Next:Tee Program, Previous:Id Program, Up:Clones
The split program splits large text files into smaller pieces.
Usage is as follows:
split [-count] file [ prefix ]
By default,
the output files are named xaa, xab, and so on. Each file has
1000 lines in it, with the likely exception of the last file. To change the
number of lines in each file, supply a number on the command line
preceded with a minus; e.g., -500 for files with 500 lines in them
instead of 1000. To change the name of the output files to something like
myfileaa, myfileab, and so on, supply an additional
argument that specifies the file name prefix.
Here is a version of split in awk. It uses the ord and
chr functions presented in
Translating Between Characters and Numbers.
The program first sets its defaults, and then tests to make sure there are
not too many arguments. It then looks at each argument in turn. The
first argument could be a minus sign followed by a number. If it is, this happens
to look like a negative number, so it is made positive, and that is the
count of lines. The data file name is skipped over and the final argument
is used as the prefix for the output file names:
# split.awk --- do split in awk
#
# Requires ord and chr library functions
# usage: split [-num] [file] [outname]
BEGIN {
outfile = "x" # default
count = 1000
if (ARGC > 4)
usage()
i = 1
if (ARGV[i] ~ /^-[0-9]+$/) {
count = -ARGV[i]
ARGV[i] = ""
i++
}
# test argv in case reading from stdin instead of file
if (i in ARGV)
i++ # skip data file name
if (i in ARGV) {
outfile = ARGV[i]
ARGV[i] = ""
}
s1 = s2 = "a"
out = (outfile s1 s2)
}
The next rule does most of the work. tcount (temporary count) tracks
how many lines have been printed to the output file so far. If it is greater
than count, it is time to close the current file and start a new one.
s1 and s2 track the current suffixes for the file name. If
they are both z, the file is just too big. Otherwise, s1
moves to the next letter in the alphabet and s2 starts over again at
a:
{
if (++tcount > count) {
close(out)
if (s2 == "z") {
if (s1 == "z") {
printf("split: %s is too large to split\n",
FILENAME) > "/dev/stderr"
exit 1
}
s1 = chr(ord(s1) + 1)
s2 = "a"
}
else
s2 = chr(ord(s2) + 1)
out = (outfile s1 s2)
tcount = 1
}
print > out
}
The usage function simply prints an error message and exits:
function usage( e)
{
e = "usage: split [-num] [file] [outname]"
print e > "/dev/stderr"
exit 1
}
The variable e is used so that the function
fits nicely on the
page.
This program is a bit sloppy; it relies on awk to automatically close the last file
instead of doing it in an END rule.
It also assumes that letters are contiguous in the character set,
which isn't true for EBCDIC systems.