How does file type detection work?

The file command

The file command (which is usually installed by default) is used to determine the type of a given file.

Interestingly the result is almost always correct, just imagine the number of different file types (we have currently 2280 registred MIME types).

The Windows file type detection technique is very naive because it's mostly based on the file extension and you can easily break it.

You can find the complete manual of the file command here: man file.

The file command uses multiples technique to determine the file type.

Magic bytes

The idea of the magic bytes is to provide the type information (and sometimes version) in the first few bytes of the file. It is usually used for binary files, but can also be used for text files (since characters are also bytes).

For example, the WebAssembly binary format (wasm) has the following definitions:

Definition
          
magic   ::= 0x00 0x61 0x73 0x6D
version ::= 0x01 0x00 0x00 0x00

module  ::= magic version ...
          
        
Where module is the binary file.

As you can see the first 4 bytes represents the file type (\0asm for short) and the 4 next bytes the version (0x1 here).

Each file type has to be registred and has an unique sequence of magic bytes to avoid any conflicts.

RegExp

Text files can not use this kind of API, the file command uses a collection of RegExp instead.

For example, to detect an HTML file the following RegExp are used:

Definition
          
\<head\>
\<title\>
\<a\ href=
          
        
While it's possible to have an HTML file without these tags, it's very unlikely in real world usage.

In the case of shell, the interpreter is usually indicated on top of the file (example #!/usr/bin/python). It is also used for file detection.

Source

You can find the sources of the file command here and the rules for the detection here.

Contact me

Sven Sauleau

Say hello: sven.sauleau@xtuc.fr.

Ping me on Twitter: @svensauleau.

© XTUC SAS
N° SIRET : 821 797 891 00016, RCS Strasbourg TI 821 797 891