Preservation Home


Part I

This section describes the design goals and challenges involved in creating a new file format for floppy disk images.

The scope is limited to Japanese microcomputers utilizing IBM-style drives: X68000, PC-98, PC-88, etc. Disk duplication in Asia tended to be much more primitive than in the West, introducing additional complications.



Desired Characteristics

  • compact
  • consistent
  • comprehensive
  • elegant
  • durable

    Compactness

    Each disk image should be under 2 MB in size. For example, the common 360 RPM, 77 cylinder, double-sided, double-track, high-density format should be around 1.2 MB in size.

    Consistency

    By nature, all floppy disks have areas that cannot be consistently read or even written (as during manufacturing). Controlling for this unfortunate aspect is critical, particularly when hashing is involved; the lowest-level formats are basically analog recordings and therefore every sample, even of the same physical disk, is unique. Even the highest-level formats currently available cannot consistently represent certain disks.

    Comprehensiveness

    Most existing formats cannot cope with all the disk features required for compatibility with the entire corpus of software. The two that do are esoteric and neither compact nor consistent, albeit ideal for archival or research purposes.

    Elegance

    Image files should not require significant amounts of parsing or error checking, nor should they contain unnecessary or subjective data (such as user-defined strings). No headers.

    Durability

    In the unlikely event an issue is discovered, it should be correctable without modifying existing files, or at least not any files subject to copyright.



    Prior Art

    Although not a perfect analog to floppy disks, compact discs (CDs) have some similarities and are commonly stored in a multi-file format that meets many of the above requirements. What follows is a brief introduction to the various components. It will serve to analogize a proposed multi-file format for floppy disks.

    CUE/CCD/etc.

    Often text-based, this file contains information that doesn't fit well into the below categories, such as track layout information, ISRC strings, and other esoteric features. Certain discs are simple enough that their images can be utilized without this particular file.

    Note that CUE files do not appear to have any copyright entanglements. This makes it easy to correct, update, and redistribute them if needed (which can and does happen, albeit typically for minor reasons).

    BIN/IMG/ISO

    This file (or multiple files in some cases) contains the majority of the data on a CD. Not coincidentally, such data is also the most important and easiest to read.

    SUB

    Subchannel data is hard to read with typical consumer drives and software. It is typically not required in order to make use of disc images, although some do require it. Due to the above characteristics, it makes perfect sense to store such data separately.



    Floppy Disk Data Classification

    This subsection attempts to define and categorize the different forms of data found on a floppy disk.

    Miscellaneous Information

    This comprises layout, timing, metadata (disk dimensions, presence of writability notch, etc.), and other information that is best represented at a higher, abstract level than raw data bits.

    Not likely to be subject to copyright.

    Standard Data

    This term refers to the data contained within "standard" sectors, which are the sectors expected to be found in a properly formatted disk. The standard-sector inventory will vary depending on the disk density and low-level formatting utilized, although there aren't many variations.

    Such data is characterized by its reliability and reproducibility. Most of the data on a disk will fall under this classification. It is also the most important data, being critical for the software to run properly.

    Non-Standard Data

    This covers the remaining data on the disk - everything outside the standard sectors, including possible non-standard sectors as well.

    Most of this data is not directly used by the software itself, although a small percentage of it may be utilized for copy protection on some disks.

    Whether this data is covered by copyright is debatable. Some of it, being simple repeated patterns or non-digital, likely is not. Is high-entropy garbage copyrightable?

    A few subcategories of inconsistent data are discussed below. It should become apparent why non-standard data should, at at minimum, be stored separately from the standard data.

    Subcategories of non-standard data subject to manufacturing variances

    Non-Data

    Largely an artifact of the duplication process, this stuff does not contain any actual intelligence and can hardly be described as data. Highly inconsistent, it only has value in that its defective nature is occasionally leveraged for copy protection purposes.

    Although magnetic media are often conceptualized as being digital, in reality they are analog. Every floppy disk has areas that cannot be interpreted digitally; you can try, but the results usually won't be consistent between attempts, and even that level of unreliability won't be consistent between different instances of the same master disk due to manufacturing variances.

    Furthermore, the usually-random determination made when reading such "non-data" affects the immediately subsequent reads such that it can have a cascade effect on real data. Fortunately, this effect is well known and the non-digital areas are normally isolated from the standard data. It is only when non-data directly impinges on standard data, or non-standard data required by copy protection, that it creates major file-format difficulties.

    Otherwise-Consistent Data

    This category describes data that can be read consistently but will take different forms on each instance of the same master disk. By chance, you might find that some copies have matching data, but with a large enough sample size you will see many legitimate variants.

    This category of data is perfectly illustrated by a couple of commonly-used protection schemes where the critical non-standard data is mostly readable (except for an embedded non-digital area) but the length of certain patterns is subject to manufacturing variances. The software accounts for the expected variations, and thus there are many possible forms of data that will satisfy the check routine.

    Manufacturing Data

    This is the term we use to describe data added to a disk during manufacturing, such as time stamps, for quality control or other purposes. It could conceivably be added at the point the disk is created or during mass duplication or both.

    Such data is completely irrelevant to the software contained on the disk, but nevertheless is another factor that must be accounted for when attempting to create a single, canonical preserved image of a disk.



    Part II

    This section will describe the file format proposals in more detail.

    For now, we are merely submitting the general idea that a new format is desirable and that a multi-file approach is well suited to the complexities involved.

    Preservation Home