An introduction to file handling in programs - buffering, standard in and out, and file handles

Archive - Originally posted on "The Horse's Mouth" - 2010-09-21 08:11:46 - Graham Ellis
Stdout and Stderr

Programmers typically don't write code to output to the screen / current window, as to to so would be to provide an inflexible system - instead, they output to what's known as "standard out", also known as stdout. Usually stdout defaults to the screen / current window so there's no difference as far as the user is concerned, but ...
• It can be redirected to a file in the command line / shell so that the user can run a program, saving the output for further processing / emailing
• It can be piped into another command through the shell - so that a whole series of programs can be bolted together to process data through a series of stages
• It can be sent by email to the instigator of batch or timed jobs (set up by at or cron / crontab for example) rather than the instigator having to be up in the middle of the night when a bookkeeping job runs
• It can be routed to a user's browser via mechanisms such as the Common Gateway Interface (CGI) through the Apache httpd web server - thus producing results via a web page

In addition to stdout, there's another output called "Standard error" or stderr - and that's where the programmer should send warnings and errors. The idea of stderr is that error messages come up on the screen where the user's running a program from, even if the user has chosen to send the main output from the program to a file by redirecting stdout. The user wants to know if a program has had problems without having to delve deep into the output reports!

File handles and buffering

Output from programs is "buffered" ... setting up a route from the processor to the screen / file / browser for every single character would be inefficient, so there's an intermediate element that's called a "file handle". When a programmer outputs something, it's actually buffered by the file handle and then gets sent in batches - typically around 4k - to wherever the output is routed. It's efficient to do it this way.

On input (from a file, or from "Standard In" a.k.a. stdin) there is also a file handle buffer; again, this allows a whole sector from the disc to be efficiently transferred into memory, whereas the program will typically call it up line by line.

There are times that buffering would be inconvenient, and it's common (but not universal) for output buffers to be flushed at times other than when they're full:
• When a user input is about to be requested (it's nice to be able to read the prompt before you have to answer a question)
• When the file is closed (so that the complete data is stored)
• When a new line character is added to the buffer on a file handle that's outputting to the screen (so that slower running processes will be seen to be making progress)
• and at other times under program control (to allow the programmer to output a series of dots as a progress bar and have them appear to indicate progress rather than in a big splurge when the job's complete!)

Opening and closing file handles

In most of the languages we teach, stdin, stdout and stderr are open and available to you in all of your programs, without the need to exceptionally open them. There's an difference in PHP - which is first and foremost a web page language, where you'll need to open stdin as follows:
  $standard_in = fopen ("php://stdin","r");
before you can read from it as follows:
  $line = fgets($standard_in);

But if you want to access a file through a file handle (the usual way to do it!), you'll need to open the file with some sort of open or fopen statement first. That sets up the file handle buffers, etc, for you for the particular file you're going to be accessing - it would take an impractical amount of time and space to have every file on your disc automatically opened and available to you!

When you open a file handle for output, you've usually got a choice as to whether you open it for write (which means that any file that already exists with the same name in the same folder / directory will be completely overwritten), or for appens (which means that anything you write to the file will be added onto the end of any pre-existing data).

File Handles in Perl

In Perl 5 (and Perl 4 if you can remember it!), file handles are special variables which are expressed as bare words - there's no $ @ & % or * in front of the variable name. It's conventional to write them all in CAPITALS; that way, programmers coming to look at the code later know straight away that they're looking at a file handle and not some other sort of structure such as a predefined function or sub.

You have two standard output file handles available to you:
  STDOUT
  STDERR

You have two standard input file handles available to you:
  STDIN
  DATA
(DATA can read anything you've stored after a __END__ line in your program)

There are two forms of the open statement - one with two parameters, and one with three. In the latter, three parameter case, the first parameter is the file handle name, the second indicates if it's read, write, or append and the third is the file name. In the two parameter case, the mode (read, write append) and name are combined; that's the older way, but it's less elegant and I recommend you always use the more modern, but slightly longer, version for any new code you write.

When you're outputting to a file Perl (as always) has a staggering variety, but I recommend print or printf be used in most cases. With both of those built in functions, you specify the file handle directly after the function name and before a list of the values to be output. By default, print and printf send output to STDOUT, but there's no reason why you can't explicitly code this if you want to.

Input to a Perl program uses the < ... > operator, with the "..." being replaced by the file handle - that's STDIN or a named file handle. This operator defaults to reading from a file named on the command line (or STDIN if none is given) if you leave out the "..." completely. If you save the result of a read operator into a list, you'll read all data from the source up to the end of file and save each line of data into the next element of a list. Very useful - but not recommended if you've potentially got a huge file what may lead to memory overflow (we have customers with 10Gb files!), nor if you're reading from the keyboard, as few users will apprecite / understand that they need to type in ^D on Unix / Linux systems, or ^Z n/l on Windows systems to indicate "end of file".

See [here] for a Perl example including output to file, STDOUT implicity, STDOUT explicity, and STDERR.

Main Content

An introduction to file handling in programs - buffering, standard in and out, and file handles