Categories
Blog

The problem with recovering text files

This is a question I get a lot, so I thought I would expand on it more formally: “Why can’t DiskDigger recover plain text (.txt) files in ‘deeper’ mode?

To begin, let me remind everyone that DiskDigger can recover text files in “deep” mode, where it scans the structure of the file system and recovers files based on clues provided by the file system. This makes “deep” mode capable of recovering any type of file, including text files, while only being able to recover these files from a healthy file system.

However, in “deeper” mode, things are very different. Since DiskDigger no longer relies on the file system to parse the structure of files on the disk, it can only detect files based on byte sequences known to exist in certain types of files.  For example, all PNG image files begin with the byte sequence 89 50 4E 47. Therefore, DiskDigger can look at every sector of the disk, and if it begins with this sequence of bytes (files must be sector-aligned), it knows that there’s a PNG file at that location.

The same is true for many other types of files, like .JPG, .DOC, .MOV, etc.  It’s also true for files that are built from text, but have a consistent structure, such as .HTML, .XML, .RTF, etc.

So now we come to the problem with pure text files. Unlike other types of files, text files do not contain any identifiable sequence of bytes. They only contain… well… text!  There’s no underlying binary structure.

This makes it nearly impossible for DiskDigger to “pick out” a text file from all the other random content on a disk.

Despite all this, there are a few remote possibilities for recovering text files which are an active area of development in DiskDigger. None of these are perfect, but they may eventually lead to a solution for recovering some text files:

  • Some text files may be encoded in Unicode (specifically UTF-16). In this case, the text file will have a starting byte signature, which is either FE FF or FF FE. Unfortunately this signature is far too short to meaningfully identify UTF-16 files, and will produce too many false positives.
  • Since many text files are encoded in ASCII or UTF-8, and written in English, we can expect them to contain only characters between 0x20 and 0x7F (along with \n and \t). We can then perform a statistical analysis on each sector of data, and if it contains mostly characters within our desired range, we can consider it to be part of a text file. There are several problems with this approach, however:  we won’t be able to determine the size of the detected text file, and we won’t be able to tell where one file ends and another begins (if there are two text files next to each other on the disk). Also, since this is a statistical method, it will surely produce false positives, as well.
  • Some text files do have some semblance of structure in them, in the sense that they have an identifying signature, but not in a consistent location. For example, most C and C++ source files have the word “#include” somewhere near the beginning. By searching an entire sector for this kind of signature (independent of offset), we can be somewhat certain about the presence of a particular file. This kind of functionality is actually already built into DiskDigger’s custom heuristics feature. This method, however, still has the problem of not being able to detect the size of the recoverable file.

As discussed above, it’s generally not feasible to recover plain text files, because they have no discernible binary structure.

It is, however, possible to recover a text file using custom heuristics, as long as you know an exact sequence of letters that is certain to appear near the beginning of the file.  I will write a short tutorial on performing this task in a future article. Stay tuned!