HomeDigital EditionSys-Con RadioSearch Java Cd
Advanced Java AWT Book Reviews/Excerpts Client Server Corba Editorials Embedded Java Enterprise Java IDE's Industry Watch Integration Interviews Java Applet Java & Databases Java & Web Services Java Fundamentals Java Native Interface Java Servlets Java Beans J2ME Libraries .NET Object Orientation Observations/IMHO Product Reviews Scalability & Performance Security Server Side Source Code Straight Talking Swing Threads Using Java with others Wireless XML

A Class for Reading Binary Files, by Jeff Heaton

Java contains an extensive array of classes for file access. A series of readers, writers, and filters make up the interface to the physical file system of the computer.

The advantage to this sort of system of classes is that the programmer is freed from the overhead of dealing with the physical layout of files. The main disadvantage to this architecture is that the programmer is isolated from the physical details of how a file is stored. Java programs have a distinct, well-defined way in which they store data to files. Unfortunately, this complicates matters when dealing with files created by other languages.

This article presents a reusable class that deals with binary files. Methods are provided that allow the programmer to read a variety of standard numeric and string formats. Additional methods are provided that take into account signed/unsigned, little/big-endian storage as well as file alignment. Using this class, the programmer can read nearly any sort of binary file. An example program is provided that will read the header from a GIF file.

One of the first problems to overcome is reading an unsigned byte. Java treats nearly all types as signed. In order to do the mathematics later required to convert bytes into larger data types the bytes must be unsigned. A protected method is provided to read bytes in an unsigned form. Converting the byte to a short and then trimming all but the least significant eight bits does this. This is done with the following line of code:

protected short readUnsignedByte()
{
return (short)(_file.readByte() & 0xff);
}

Using the BinaryFile Class
The BinaryFile class can be seen in BinaryFile.java. To use the BinaryFile class create a RandomAccessFile class to the file that you would like to work with. This file can be opened for read or write access. Then construct a BinaryFile object, passing in your RandomAccessFile object to the constructor. The following two lines prepare to read/write to a file called "test.dat".

file=new RandomAccessFile("test.dat","rw");
bin=new BinaryFile(file);

Once this is complete you can call the various methods provided to access different data types. The methods to access the various data types are prefixed with either read or write and then the type. For example, the method to read a fixed length string is readFixedLengthString.

String Data Types
There are many ways that strings are commonly stored in a binary file. The BinaryFile object supports four different string formats. The null-terminated and fixed-width null-terminated types used by C/C++ are supported. Additionally, fixed-width and the length-prefixed string used by Pascal are also supported.

Null terminated strings are commonly used with C/C++ and other languages. In this format the characters of the string are stored one by one, with an ending zero character. This allows strings to be of any length. Strings stored in this format can contain any character, except for the zero character. Two types of null-terminated strings are supported.

The readZeroString and writeZeroString methods are used to read and write null terminated string. This is an unlimited length string that ends with a null (character 0). The readZeroString accepts no parameters and returns a String object. The writeZeroString accepts a String object to be written.

The readFixedZeroString and writeFixedZeroString methods are used to read and write fixed-length null terminated strings. This is the type of string most commonly used by the C/C++ programming language. The amount of memory held by this sort of string is fixed. But the length of this string can vary from zero up to one minus the amount of memory reserved for this string. In C/C++ this type of string is written as:

char str[80];

This means that the str variable occupies 80 bytes. But its length can vary from zero to 79. No matter how long this string is, it is always stored to a disk file as exactly 80 bytes.

The Pascal language uses length-prefixed strings. The Macintosh operating system is based on Pascal strings and as a result length-prefixed strings are commonly found in files generated from the Macintosh platform.

The readLengthPrefixString and writeLengthPrefixString methods are used to read and write length-prefixed strings. The writeLengthPrefixString accepts a string and writes it out to the file. The readLengthPrefixString returns a String object read from the file. Length-prefixed strings occupy their length plus one byte in memory.

The last, and simplest, string type supported by the BinaryFile object is the fixed-width string. A fixed-width string is simply an area of memory reserved for the string. The string occupies the beginning bytes of this buffer and any remaining space is padded with either zeros or spaces. It is not unusual to have to do a trim on a string just read in from this format. The readFixedString and writeFixedString methods are used to read and write fixed-width strings. The readFixedString method accepts a parameter to specify the length of the string and returns a String object read from the file. The writeFixedString method accepts a length parameter and a String object. The String object is then written to the file. If the string is longer than the specified length then the string is truncated. If the string length is less than the specified length then the string is padded.

Numeric Data Types
In Jonathan Swift's Gulliver's Travels, the nations of Lilliput and Blefuscu find themselves at war over which end of a hard boiled egg to cut before eating. Lilliput preferred the Little Endian approach whereas Blefuscu preferred to start with the large end. An inane controversy, indeed, but one that mirrors our own computer industry.

When an integer stored in memory occupies more than one byte, it is necessary to decide which byte to place first. Take, for example, the number 1025. This number would have to be stored in two bytes. The high-order byte would be four. The low-order byte would be one. This is because the integer division of 1025 by 256 is four, with a modulus of one. So we have the bytes of four and one. Is this stored as 04 00 or as 00 04? Computer scientists call the two notations little-endian and big-endian respectively. The same words as those used by Swift to describe the dilemma of the Lilliputians. The two systems can be seen in Table 1.

Table 1

So which one is predominant in the industry? Unfortunately it's a near dead heat. Most of the UNIX variants and the Internet standards are big-endian. Motorola 680x0 microprocessors (and therefore Macintoshes), Hewlett-Packard PA-RISC, and Sun SuperSPARC processors are big-endian. The Silicon Graphics MIPS and IBM/Motorola PowerPC processors support both little- and big-endian. As a result, the binary file class presented in this article will handle both standards.

In order to accommodate the little- and big-endian numbers, integers are first read in byte by byte - and then converted into the correct data type. For numbers that are four bytes, the next four bytes from the file are read into the variables a, b, c, and d. Then to convert to big-endian or little-endian, the following equation is used.

result = ((a<<24) | (b<<16) | (c<< 8) | d);// big endian
result = ( a | (b<<8) | (c<<16) | (d<<24) ); // little endian

In addition to the issue of little-endian or big-endian, numeric data types can be stored as signed or unsigned. Unsigned numbers are virtually unheard of in Java, but they are all too common in other programming languages. This causes there to be four major categories of numbers to be supported. Signed big-endian, unsigned big-endian, signed little-endian, and unsigned little-endian.

To accommodate these different systems, the methods setEndian and setSigned are provided. Set endian will accept either BinaryFile.BIG_ENDIAN or BinaryFile.LITTLE_ENDIAN. There is also a getEndian method to determine the current mode. The setSigned method accepts a boolean. True indicates that the numbers are signed. False indicates that the numbers are unsigned. There is also a getSigned method to determine the current mode.

Signed numbers are stored in a format called two's complement. Two's complement uses the most significant bit as a signed or unsigned flag. In all numbers, except zero, a value of one for this bit signifies a negative number. In the case of zero, which has no sign, this bit is set to zero. Positive numbers are stored just as they normally would be. Negative values are stored by subtracting their magnitude from one beyond the highest value that an unsigned number of that type would hold. For example -1 in a word would be stored as 0x10000 - 1, or 0xffff.

In addition to signed or unsigned, the BinaryFile object can also read a variety of sizes of number. The supported sizes are byte, word, and double-word. The methods used to read/write these types are readByte/writeByte, readWord/writeWord and readDWord/writeDWord. A byte occupies just one byte of memory. The endian setting does not affect byte read/writes. A byte can be signed or unsigned. A word occupies two bytes of memory. Words can be little or big endian. Words can also be signed or unsigned. The double-word occupies four bytes of memory. A double word, like the word, follows the endian and signed modes.

Each of the numeric read/write methods deals in Java types that are one size bigger than the underlying data type. A byte is stored in a short, a word is stored in an int, and a double-word is stored in a long. This is done to accommodate the unsigned data types. The Java byte data type cannot hold values all the way to 255. Because of this, the readByte method returns a short and not a byte. The readByte command, when working in unsigned mode, can return numbers in the range of 0 to 255. That would overflow a Java byte, so a short is used instead. These different types can be seen in Table 2.

Table 2

Alignment
Binary files are often aligned to certain boundaries, for example, "word aligned" or "double word aligned." This means that if one record only took up 10 bytes and the file is "double word aligned" then before the next record is written enough bytes must be written so that the record falls evenly on a double word boundary. The next double word boundary after 10 bytes would be 12. So two extra bytes must be written to accommodate the alignment requirement.

The BinaryFile object accommodates alignment requirements through the align method. The align method accepts one parameter that specifies the boundary to align to. This parameter is the amount of bytes that you wish to align to at this point. For example, if you were at file position 10, and you called the align method with a value of four, you would be moved to file position 12. Because 12 is the next even multiple of four after 10.

The align method works for both read and write operations. It is important to remember that the align method only alters the way in which data is written when it is called. Therefore it is likely that you will call the align method just after a record has been written.

The Example Programs
To test this program, I ran it on a variety of systems. I tested it on the little-endian platforms of Windows NT and x86 Linux. It was also tested on the big-endian platform of Sun. There are two example programs given. The first, seen in ScanGIF.java, reads the header of a GIF file. The second, seen in BinaryExample.java, opens a file named "test.dat" then proceeds to write several of the data types. The file is then closed, reopened, and the same data types are read back.

To read a GIF file header, the file is first opened and passed into a BinaryFile object. To match the format of a GIF file, the options of little-endian and unsigned are selected. The GIF file consists of a fixed with type, then a fixed with version, followed by a height and width. This is read in with the following method calls.

type = bin.readFixedString(3);
version = bin.readFixedString(3);
height = bin.readWord();
width = bin.readWord();

Using the BinaryFile object, Java programs can easily access a variety of binary file types. Perhaps in the future standards, such as XML, will make binary files obsolete. But for now, there are many such files out there that a Java program may need to be compatible with.

Author Bio
Jeff Heaton is a software designer for the Reinsurance Group of America(RGA), a college instructor, and a coauthor of Teach Yourself Visual C++ in 21 Days, Professional Reference Edition (Macmillan 1999). [email protected]

Download Assoicated Source File (~ 72.5 KB)

Java and Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. SYS-CON Publications, Inc. is independent of Sun Microsystems, Inc.