The problem with recovering text files

This is a question I get a lot, so I thought I would expand on it more formally: “Why can’t DiskDigger recover plain text (.txt) files in ‘deeper’ mode?

To begin, let me remind everyone that DiskDigger can recover text files in “deep” mode, where it scans the structure of the file system and recovers files based on clues provided by the file system. This makes “deep” mode capable of recovering any type of file, including text files, while only being able to recover these files from a healthy file system.

However, in “deeper” mode, things are very different. Since DiskDigger no longer relies on the file system to parse the structure of files on the disk, it can only detect files based on byte sequences known to exist in certain types of files.  For example, all PNG image files begin with the byte sequence 89 50 4E 47. Therefore, DiskDigger can look at every sector of the disk, and if it begins with this sequence of bytes (files must be sector-aligned), it knows that there’s a PNG file at that location.

The same is true for many other types of files, like .JPG, .DOC, .MOV, etc.  It’s also true for files that are built from text, but have a consistent structure, such as .HTML, .XML, .RTF, etc.

So now we come to the problem with pure text files. Unlike other types of files, text files do not contain any identifiable sequence of bytes. They only contain… well… text!  There’s no underlying binary structure.

This makes it nearly impossible for DiskDigger to “pick out” a text file from all the other random content on a disk.

Despite all this, there are a few remote possibilities for recovering text files which are an active area of development in DiskDigger. None of these are perfect, but they may eventually lead to a solution for recovering some text files:

  • Some text files may be encoded in Unicode (specifically UTF-16). In this case, the text file will have a starting byte signature, which is either FE FF or FF FE. Unfortunately this signature is far too short to meaningfully identify UTF-16 files, and will produce too many false positives.
  • Since many text files are encoded in ASCII or UTF-8, and written in English, we can expect them to contain only characters between 0x20 and 0x7F (along with \n and \t). We can then perform a statistical analysis on each sector of data, and if it contains mostly characters within our desired range, we can consider it to be part of a text file. There are several problems with this approach, however:  we won’t be able to determine the size of the detected text file, and we won’t be able to tell where one file ends and another begins (if there are two text files next to each other on the disk). Also, since this is a statistical method, it will surely produce false positives, as well.
  • Some text files do have some semblance of structure in them, in the sense that they have an identifying signature, but not in a consistent location. For example, most C and C++ source files have the word “#include” somewhere near the beginning. By searching an entire sector for this kind of signature (independent of offset), we can be somewhat certain about the presence of a particular file. This kind of functionality is actually already built into DiskDigger’s custom heuristics feature. This method, however, still has the problem of not being able to detect the size of the recoverable file.

As discussed above, it’s generally not feasible to recover plain text files, because they have no discernible binary structure.

It is, however, possible to recover a text file using custom heuristics, as long as you know an exact sequence of letters that is certain to appear near the beginning of the file.  I will write a short tutorial on performing this task in a future article. Stay tuned!


Recovering QIC-150 tapes: a case study

A friend of mine recently found something very intriguing:  several backup tapes from his old Amiga 500 computer from around 1995.  Since he no longer has the original Amiga machine, he no longer has a way of reading the tapes, and thus no way of rediscovering the old documents, letters, or any other memories that were lost when the old computer was thrown away.

Without hesitation, I offered to help recover the data from the tapes, not only for the challenge of it, but also because I have never dealt with Amiga computers or Amiga-formatted media, and this would be a great opportunity to familiarize myself with a significant part of computing history, even if it has already come and gone.  This is a brief chronicle of the steps I took to recover the data, just in case someone else in the future (including myself) needs to perform a similar task.

One of the most pleasurable aspects of this experience was how easily everything came together, owing mostly to the abundance of information on the web regarding every step of these kinds of processes. Hopefully the information in this article will contribute to that abundance.

The tapes were QIC-150 cartridges (specifically Sony D6150). After rummaging a bit through my attic, I realized that I actually own a tape drive that’s capable of reading QIC-150 tapes. The drive is an Archive Viper 2150S, which uses a SCSI interface:

Since the drive is SCSI, I would need a SCSI host adapter card to interface with it. Luckily, after rummaging a bit more, I found an Adaptec AHA 2940UW adapter which looked like it would be perfect for the job:

I installed the adapter into an old PC that “still” has a PCI bus (how far we’ve come!), connected the drive to it with a 50-pin SCSI ribbon cable, and put a terminator on the end of the cable (I could have also used 8-pin SIP resistors for which the drive has sockets).  The jumpers on the drive were already set to have a SCSI ID of 0, so I didn’t have to make any physical changes to the drive or the adapter.

I booted up the PC, and was happy to see that the Adaptec card was correctly reporting the Viper 2150S as being connected with an ID of 0. So far, so good!

The computer booted into Windows XP, and at this point I encountered one of the very few hiccups in the whole process:  Windows XP does not have a driver for the Viper 2150S drive. Apparently Microsoft discontinued support for it after Windows 2000.  That was perfectly fine, since my next instinct was to boot into Linux (I simply used an Ubuntu 12.04 live CD).

Within Linux, the tape drive worked perfectly.  I inserted the first tape, and typed the command to rewind the tape to the beginning:

$ sudo mt -f /dev/nst0 rewind

The drive obeyed without any errors! So then, I decided to go for all or nothing:  I issued the command to simply dump the entire contents of the tape to a binary file:

$ sudo dd if=/dev/nst0 of=tape1.bin

The drive started whirring, and I watched in amazement as twenty-year-old technology was working flawlessly in the world of the future. After a few minutes, the data dump was complete, and I had a pristine image of the data from the tape. I repeated the same process for the other tapes, which also had zero issues or bad blocks. Three cheers for Sony for developing such a sturdy medium for storage, and to the owner of the tapes for preserving them so well.

After reading the binary images of the tapes came the next hurdle: How is the data formatted?  Tapes do not have a “file system” like hard drives do, so the data on the tape is entirely application-specific. In order to make sense of the data, we would need to know the precise application that was used to write it!

Here’s what we know about the data from the tapes:

  • It’s from an Amiga system
  • It’s from around 1995

The owner of the tapes did not remember which software he used to make the backups, but we could assume that it was probably a “common” backup application at the time.  After some searching around the web, it became clear that the most common backup tools for Amiga at that time were Ami-Back, Quarterback, and Diavolo.

The next step was to set up an emulated Amiga environment, so that we could run the original Amiga software to restore the backups.

The de facto solution for Amiga emulation is WinUAE, which is a very impressive and virtually feature-complete emulator.  In order to work properly, WinUAE requires a ROM image (referred to as “Kickstart” in the Amiga world). After that, it can run bootable Floppy images for Amiga, or a bootable Amiga hard drive (with AmigaOS loaded on it) that can be mapped to any Windows folder. (the ROM image and the AmigaOS files are the only semi-difficult things to obtain; beyond the scope of this article)

I found an extremely helpful step-by-step guide for installing AmigaOS 3.9 under WinUAE.  To my surprise, I also found a vibrant and thriving community of Amiga users who are willing to help find and share old software.

In no time at all, I had a fully-operational AmigaOS system, ready to attempt some trial-and-error methods of restoring the backup files.

My plan was the following:  I would install all three of the most common backup tools for Amiga (Ami-Back, Quarterback, and Diavolo), and use each one to make a test backup.  I would then compare the binary format of the test backups to the tape images.  With any luck, one of the binary formats would match up.

And, lo and behold, the backup from Ami-Back was a match!  So now, knowing that Ami-Back was the correct utility, I mounted one of the tape images as a hard drive in the emulator, and used Ami-Back to “restore” from it.  It worked like a charm. The files poured out of the backup image, as I watched, yet again, in amazement.

It was a complete backup of the entire Amiga system from 1995. Since I had already set up the emulator correctly, I was able to make it boot directly from the restored system image. In effect, I was able to see the screen of the computer, exactly as it would have been seen nearly twenty years ago:


DiskDigger released

The latest version of DiskDigger is now available for download! Go to the DiskDigger website to check out the new features and download the updated program.