Storing a million images in the filesystem

I’d recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image’s sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

  1. First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
    • 12345 -> 000000012345.jpg
  2. Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
    • 000000012345 -> 000/000/012
  3. Store the file to under generated directory:
    • Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
    • For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

  • Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
  • There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
  • Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
  • Directory structure itself will take some disk space, so you’ll do not want too many directories.
  • With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
  • If you need to access files from several machines, consider sharing the files via a network file system.
  • The above directory structure will not work if you delete a lot of files. It leaves “holes” in directory structure. But since you are not deleting any files it should be ok.

Leave a Comment