Conventions - Floppy Disk Preservation

Conventions

Naming Forms

There are three forms in which we generate titles. Titles can be further customized aside from their form.

Except for the Japanese-friendly form, all titles are based on the disk label, as this is the most consistent method for picking titles (since the label is almost always available).

Original / Unicode

This form attempts to match the title as shown on the label as closely as is practical.

Titles in this form require full Unicode support, as they mix characters that map to different local code pages.

Japanese / Shift-JIS

This form attempts to generate titles that are more Japanese-friendly, typically by substituting yomigana for romaji.

As not all labels include yomigana, we are willing to accept a rendering that does not appear on the label as long as it is from an official source, such as the manual or box. This is one area where there are gaps in the data, as we typically do not have access to all the materials. Please contact us if you have relevant scans.

Titles in this form are Unicode encoded but are restricted to characters in the Shift-JIS (932) character set.

English / ASCII

This is the basic 7-bit ASCII form.

Words written in non-English characters are romanized or translated, depending on their provenance. If a preferred romanized rendering exists and is known, it will be used (e.g. Tokyo instead of Toukyou). The titles may also be cleaned up slightly to better represent English conventions by adding spaces or whatever. Internally the original capitalization is preserved, although the default XML file generated for this form corrects it automatically, since most people don't like ALL CAPS.

Full (Extended Names)

This is not a true form but a special case of the Original form. Wherever possible, the official reading - if known - will be appended to the base name, even when it does not appear on the label.

The off-label yomigana are derived the same way as those used for the Japanese form.

Examples

The following are examples of the three different forms. Additional fields and ornamentation are usually added to the final rendition, depending on the options selected, so these are "bare" examples that wouldn't ordinarily be used for file names.

MÄRCHEN MAZE（メルヘンメイズ） A

メルヘンメイズ A

Maerchen Maze A

出たな!!TwinBee（ツインビー） A

出たな!!ツインビー A

Detana!! TwinBee A

Étoile Princesse（エトワールプリンセス） DATA 2

エトワールプリンセス DATA 2

Etoile Princesse Data 2

Note that it is impossible to mix accented characters and katakana without Unicode, as seen in some of the examples, because no "ANSI"/MBCS code page contains both types of characters.

Writability

Some software will reject a disk if it is in an unexpected write-protection state. The "[W]" flag is used to indicate disk images that are known to cause problems if they are read-only. At least one X68000 emulator will check for such flags in the filename and react accordingly.

Aside from user disks, we normally keep disk images marked read-only at the file system level to prevent them from getting modified. Hence we don't typically identify situations where a particular disk is required to be read-only. Most commercially produced floppies we've seen are either unnotched or sealed, and it is good practice to keep images read-only whenever possible. That said, "[R]" can be used to explicitly indicate that a disk must be read-only, even though this should be the default assumption unless [W] or [m] are present.

User Disks

We do not normally hash user disks, since their intended purpose requires them to be mutable. We make an exception for user disks that have a small number of possible forms after creation.

In order for a user disk to be included in our lists, the process for creating it must meet the following requirements:

100% of the data on the disk must be overwritten during creation

This requirement ensures that any disk can be used for creating the user disk; all pre-existing data will be wiped out.

Some games, Akumajou Dracula being a good example, do not overwrite the entire disk, meaning that there are an unlimited number of possible variations depending on what was on the disk originally.

There are few, if any, customizations or other changes made for a new user disk

An example of a game that fails this requirement is Death Bringer, which mandates that the protagonist be given a name.

Ys III, on the other hand, only allows for limited customization: one of three possible difficulty levels must be chosen. It is therefore reasonable to hash all three variations.

A common source of unsuitable user disks is games that change files immediately after creating the disk, which usually updates the modification date and time. This has the effect of making every user disk unique (barring unlikely coincidences).

Variants

Sadly, we've found that legitimate variant versions of disks are surprisingly common, at least on the X68000. This makes preservation much more difficult, as we need pristine dumps of every variant. Now consider that some of these disks were shipped in a writable state and immediately write to themselves once they are booted.

Differentiating variants is usually done by adding what we call a "descriptor" to the filename. The descriptor is simply a string enclosed in parentheses appearing near the end of the filename. Descriptors are only used in multi-disk games if the various disks can be safely mixed and matched; otherwise a secondary ID (which appears prior to the disk ID) will be used, which keeps compatible disks grouped together.

Descriptors are chosen based on some arbitrary differentiating characteristic. There is no consistent way to interpret such strings; however in most cases the resulting filenames will sort in order of age, to the extent we are able to determine this, and to the extent such a metric makes sense. Purely descriptive strings (e.g. "virus in master") will not necessarily sort in a particular order.

In other words, when using an ascending sorting algorithm, the "best" version will normally appear later. However, in some cases the variants are all basically the same; some games were published in several variations with minor differences solely to make cracking more difficult. (The most extreme case of this we've seen has at least eight variants that we've identified so far!) In such cases there is no "best" or "newest" version, although using the one sorted lower is harmless.

Finally, be aware that although we identify and track variants, we normally only apply descriptors when it is required to differentiate hashed files that are getting published. In other words, you won't see descriptors until we have hashes for two or more variants of the same disk that all meet our release criteria. This is done in case another variant appears in the meantime, which could require that the descriptor strings be reformulated. (In our limited experience, discovering two or three variants is not an extraordinary occurrence.)

Confidence Scores

Whether or not the hash for a particular disk image qualifies to be published depends on an internal number called its confidence score.

Disk dumps are analyzed and a number of factors are considered when calculating the confidence score. If it meets the threshold, then it may be included when generating the files we publish. (Other constraints may apply as well, depending on the subproject, and contributors can request that hashes of their dumps not be published.)

We are pretty conservative about this, due to the difficulties inherent in a writable medium. In the future, we may decide to lower the threshold slightly so that more hashes will be published, which increases the risk of non-pristine images slipping through.

As an aside, a similar concept is also used in the analysis of "fixed" images (as seen in the Jouyou set) although we don't track it numerically. It is largely based on the level of testing that was done.

Preservation Home