Start a Conversation

Unsolved

This post is more than 5 years old

1372

July 5th, 2011 07:00

programming best practice

Hello,

I'm looking to backup a large number of files (I estimate somewhere around 200 millions). Each file size may vary between a few KB to few GB .

Due to the way the files are being generated, my current approach would be to store 1 file per 1 C-Clip. This would mean 200000000 C-Clips in 5 years.

Is this feasible (and what would be the best practice approach)?

Thanks a lot,

Virgil

409 Posts

July 5th, 2011 07:00

For each object you ingest to centera a CDF is written and a Blob.  The CDF will be mirrored (2 copies) and the Blob will also be mirrored if the cluster is using CPM or if using CPP, cut into 6 fragments, a parity fragment calculated and the 7 fragments written to the Centera.  For the time being lets just consider CPM (mirroring)

So be default for each of your object, 4 centera objects are created.  So your 200M objects become 800M centera objects

Each centera node can have 100M objects on it so 8 nodes would get you there

You can reduce the number of centera objects need by enabling embedded blobs, which base64 encodes the blob and adds it as an attribute to the CDF so you drop the centera object count by half but only for objects <= 100KBytes in size.  So if you are writing a lot of <100KB small files this is a good thing to enable.

As you can't give a breakdown on file size distribution I can't give you a definitive best practise but if you were storing lots of large files then I would suggest using parity protection which reduces the storage overhead to get the object protected.

If you were storing lots of small files then you might want to consider containerisation, aggregating the small files into a large container and writing that to the centera.  But if you want to have retention and/or deletion to be at the individual file level you cant really do this.

July 5th, 2011 08:00

Hello Paul and thank you for your answer!

Some specs - we will then go for CPM. Our Centera have only 4 nodes.

Would make any difference if I'll embed within a single C-Clip BLOBs for all the files from one day (aprox. 100000)?

Also, most of the files are below 1 MB, but there are rare exceptions when we get >1GB.

Is any of these changing your statement :-)?

Thank you,

Virgil

No Events found!

Top