serve 1 000 000 images from one directory?

Portal Home > Knowledgebase > Articles Database > serve 1 000 000 images from one directory?

Posted by Elliot01, 08-13-2008, 06:56 AM
Hello, I plan to build huge image gallery, using lighttpd to server these images. It would be easiest for me to have all these files (more than 1 000 000) in one directory? Is it ok? I have no idea, if this can cause any problems. Should I part my files into several directories, does it make serving better/faster? Thank you for your help. Elliot.
Posted by anandkj, 08-13-2008, 07:52 AM
It'd be better if you could go for a database back-end.
Posted by zuborg, 08-13-2008, 08:32 AM
I've explained Here how to store large number of images on server.
Posted by stephanhughson, 08-13-2008, 11:51 AM
I wouldn't recommend having them all in one directory, it's likely to cause you hassle. When I worked for a webhosting company, I once had a customer with thousands. He uploaded using FTP but ran into problems where only the first X files were displayed, the directory listing took ages etc. These aren't impossible issues to overcome, but sooner or later there will be hassle. You also need to think about how you are going to back all of these images up safely. Splitting them up logically into directories sounds right as zuborg mentioned. The actual number of files you can store in theory is really quite high, but it's not worth it if you can just store them logically: e.g. giraffe.jpg /home/whatever/g/i/giraffe.jpg example.jpg /home/whatever/e/x/example.jpg excellent.jpg /home/whatever/e/x/excellent.jpg Something like that should split them up fairly nicely. A script would automate this. If you are thinking of storing thousands of little files, remember that even a small file (e.g. 500 bytes) will still occupy a particular amount of space, depending on how the drive has been formatted. You also need to pay attention to how many inodes are free.
Posted by Harzem, 08-13-2008, 11:59 AM
In linux shell, some commands have problems working with directories that have over 2000 images (iirc). So splitting into directories would definitely be what I'd do. You may not want to split them with the first two letters, because some letter combinations have more frequent usages and you may not be able to split the files into similar sized directories. There are more complex ways, like hashing the file name and splitting them with the first two or three letters. Or the best, use database backend to store where each file is.
Posted by steve_c, 08-13-2008, 12:26 PM
100 directories X 100 sub-directories = 10,000 total directories 10,000 total directories X 100 files = 1,000,000 files The path will be 001/001/abcdefg.jpg Don't use the actual name of the file uploaded by the user. Instead create a random token for the file and save the info in your database. You database table may look something like this: create table images ( id int unsigned not null auto_increment, token varchar(50) unique not null, member_id int unsigned not null, title varchar(80) not null, description text not null, category_id file_path url etc ... To serve the image, create a script to look up the database for the file. http://www.yoursite.com?image=d7yjDU43np
Posted by Elliot01, 08-14-2008, 05:35 PM
Thanks to everybody! I will try to work out some database free solution, I would like to run lighttpd with static content only. Maybe I will go for something based on first two or more letters as some of you suggest: the number of files in each directory wont be the same, but it is probably still better than storing all files in one directory.
Posted by RBBOT, 08-14-2008, 07:13 PM
Definitely do not put them all in one directory. Apart from the reasons already mentioned, the sheer amount of time it takes the filesystem to lookup a certain file will cause a performance hit that you needn't have. If you want a more equal distribution for dividing them into subdirectories than taking the first few letters of the filename, calculate an MD5 hash of the filename, then use the first N characters from the hash, which should statistically be evenly distributed over a large set of filenames.
Posted by Harzem, 08-14-2008, 07:16 PM
Still, as I and RBBOT said, you might want to use md5'd versions of the file names for categorization. Does lighttpd have php support?
Posted by plumsauce, 08-14-2008, 08:26 PM
Following the preceding advice would be a good idea. Now, MD5 is more work than necessary in this situation because security is not a concern. A CRC-32 of the name would be just as good, and faster. Faster meaning less cpu cycles used per lookup. If you use the hexadecimal representations of the individual bytes of the CRC, you get 256 directories per level using just two of the bytes. In two levels you already have 256*256=65536 directories. Put 256 files in there, and you have 16777216 files eg. 00 -- 00 -- 01 -- 02 ... -- FF 01 -- 00 ... FF To find a file, do the crc32, transform to a directory, prepend to the file name, and that's where it's going to be.
Posted by RBBOT, 08-15-2008, 04:18 PM
md5 is superior to CRC32 for this purpose, as the algorithm of md5 is such that the bytes of the hash should appear with equal statistical distribution over a large population; therefore, you should get a fairly even number of files in each bucket. CRC32 may be simpler and faster to calculate, but does not have this property. The CPU time taken to md5 something as short as a filename is so small it shouldn't enter into the equation, even though it is more than a crc32. Also, most programming languages have a built-in function to calculate the hash.
Posted by JBapt, 08-15-2008, 05:35 PM
Hi, I've had issues with directories with to many files. System commands like ls for instance tend to break. My advice is use a database for it. It may look like overhead in developing what you need, but I reckon that in time you will have benefits. Specially managing the server.
Posted by brianoz, 08-18-2008, 11:51 AM
The issue is that the way Unix is constructed, there's an exponential performance drop off on directory scans if the size goes over a particular maximum, I think possibly around 2000. For this reason it makes sense to use subdirectories. My guess is that performance is probably roughly similar to that of a database and blobs, but I'm only guessing here; certainly files are a lot easier to deal with.
Posted by zuborg, 08-20-2008, 04:57 AM
That's not longer true: First - inode lookup is cached. Next access to same file will not access directory which contain this file to find file's inode. Second - modern file systems uses directory indexing. FreeBSD too, but don't save it to disk (for compatibility) - just build directory index at first access. Lookup through this index is quite fast and takes O(ln(num of files)) time.
Posted by CoderJosh, 08-20-2008, 10:05 AM
I'd also build some hash-driven hierarchy with maybe two levels, that should be much more painless than having that many files in one single directory. Some tools use iterators that aren't too efficient...

Add to Favourites Print this Article