Checksums and Verification
Part 2: Define and Decide
This a multi part blog. As more blogs are posted, links to those posts will be included in this blog. Before reading further, we encourage you to begin with Checksums Part 1: The 5 W's which defines what a checksum is and why it's important in the media and entertainment industry.
For Part 2, we asked YOU what you were most interested in and many answered they would like each checksum defined and explained as to which is better to use in different situations. Obviously you are all using ShotPut Pro for your offloading right? Perhaps you are researching which offloading application is right for you. This blog will also discuss appropriately comparing offload applications and their use of checksums.
We will not reinvent the wheel but will give credit where credit is due! All sources used to help explain and define checksums have been appropriately included. If you're looking for some light reading.... These are not for you! But if you're looking for more in-depth information, they have plenty.
xxHash: xxHash is an extremely fast, non-cryptographic hash algorithm, working at speeds close to RAM limits. It is proposed in two flavors, 32 and 64 bits. (SMHasher on github.io)
For ShotPut Pro, ShotSum and PreRoll Post we use xxHash 64 bit. We recommend using xxHash as the checksum type unless you have a requirement for some other one. xxHash can out perform MD5 for example because it can go at the speed of your RAM whereas MD5 is a CPU dependent process.
It is worth noting here that some UK based companies do not accept xxHash so always verify with your customer or insurance company if there is a preference or requirement.
MD5: The MD5 algorithm is a widely used hash function producing a 128-bit hash value. It is optimal as a checksum to verify data integrity. (Wikipedia MD5)
For years MD5 was the fastest and most secure checksum available. Although xxHash is becoming more widely used there are still many companies that require the MD5 checksum.
The primary things to condsider when choosing a checksum is how fast is it and when talking about files, what is the chance of a checksum collision? The collision chance is the probability that two different files will map to the same checksum value. xxHash is great because it is fast while still having a low probability of collision. Many even consider hash functions more secure than a byte by byte comparison because the chance that hardware gives back the wrong results is in many cases higher than the chance of a checksum collision.
SHA-1: (Secure Hash Algorithm 1) is a cryptographic hash function. SHA-1 produces a 160-bit (20 byte) hash value known as a message digest. A SHA-1 hash value is typically rendered as a hexadecimal number, 40 digits long. (Wikipedia SHA-1)
SHA-2: (Secure Hash Algorithm 2) 256 and 512 - this is an upgrade to SHA-1 and includes six hash functions, Imagine Products applications offer two of the six. Cryptographic hash functions are mathematical operations run on digital data; by comparing the computed "hash" (the output from execution of the algorithm) to a known and expected hash value, a person can determine the data's integrity. (Wikipedia SHA-2)
MD5 and some of the SHA checksum algorithms are sometimes still used as a requirement for different insurance companies or by the government because they are older and established, but they were designed to be cryptographic hashes. Cryptographic hashes are those originally designed to store things like passwords. They were meant to be complex and sometimes even slow to ensure that passwords were safe. This isn't ideal for many of us which is why xxHash was created.
Unless you have a specific requirement, we recommend xxHash or sometimes MD5 because the speed of the others is usually not worth the trade off in collision space. In cases of SSD to SSD copies you most likely will see a performance hit with any of the algorithms other than xxHash.
Comparing Checksum Applications
It's a good idea to choose the right tools for the job. Obviously we think our products are the best (and so do thousands of others!) but it's a good idea to test different workflow applications to be sure you are getting what you need to ensure data integrity and accuracy.
Here's one thing to remember...
Computers are designed with a combination of caches along the data handling stream. They're present in hard disks, in the connection ports and inside the computer's operating system. The idea is to speed up the return of data requests for 'known' recently accessed items.
Think of how your web browser caches web pages you're previously visited. Caching browser history allows them to quickly present recent pages when requested again without having to go back and download the entire page each time.
The Apple operating system has a similar methodology (as do drives themselves). Items recently accessed are kept in a revolving cache of RAM for fast presentation.
So when an application asks the operating system to read back the most recent file from the output hard disk the Mac OS says "Oh! No need to go get that again, I have a copy right here!" and simply returns the cached information. This is great if you're not trying to actually compare and verify one copy to another. Instead, it's just a repeat of the source file and not a fresh full read of the file from the output disk, which is meaningless for verification purposes and doesn't even offer the security of comparing files sizes.
In fact, with Apple operating systems you can not obtain a hard disk read of files (instead of from cache) without explicitly circumventing the cache - a method only seasoned programmers would be aware of or those incredibly interested in how checksums actually work - like yourselves!
To test different offloading applications, open the Activity Monitor utility on the Mac. Boot it up and click on 'Disk'. Then open the Terminal utility and flush the cache by typing in the command "SUDO PURGE".
Once you're ready, do a reasonable size offload - say 15 GB. Then look at the Disk Read and Write GBs in the lower right table. In order to perform checksum comparisons the Read GBs should be double the Write GBs. That's because you're reading once from the source, writing once to the output drive, then reading back from the output drive to calculate the checksums.
If the application or method you're considering using in your workflow doesn't have double the read GBs compared to the write GBs then it's not actually retrieving the disk's content to compare with source files and is less secure than true checksum comparisons. In other words, the final destination of your files may not actually match the source - this is a big problem if you are truly concerned with data integrity.
Remember, it is seems too good to be true - it probably is.