Make my data trustworthy

Malicious modification of research data is predicted to be the next target of anti-science, anti-research, and other nefarious actors as an attempt to discredit science, research, and researchers.  Data can also be corrupted due to natural phenomena, e.g., due to a bit flip from a cosmic ray hitting the storage medium.  Fortunately, there is a sure fire way to verify data integrity. You can use the following recipe in your research workflow to ensure that the data and the results you create are trustworthy as they pass through various stages in the research workflow.  For instance, once data has come off an instrument, you can verify integrity after processing if it's been sitting on disk a while, or to ensure it has not changed between storage into and after retrieval from a long-term archive.

Prerequisites

  1. Secure your own environment.

Prepare for verifying data trustworthiness

  1. Place the data files in a folder, e.g., "mydata".  Let's assume the folder path to be C:\mydata in Windows and ~/mydata in MacOS/Linux.  If there are a large number of files, zip them up into a few large files.
  2. Do file checksums: 
    1. Windows:
      1. Create a folder somewhere to receive the Microsoft File Checksum Integrity Verifier (FCIV) tool.  Let's choose a folder called FCIV under the top level (C:\FCIV).
      2. Follow steps 1-6 in https://www.lifewire.com/how-to-download-and-install-file-checksum-integrity-verifier-fciv-2625185 to download the tool to the FCIV folder.  Upon download, it will appear in the folder as an executable file called fciv.exe.
      3. Open a command prompt (directions for various Windows versions are provided here: https://www.lifewire.com/how-to-open-command-prompt-2618089).
      4. Navigate to the FCIV folder by running the following command: cd C:\FCIV
      5. Run the tool by typing:  .\fciv.exe -add C:\mydata -r -xml checksums
      6. This tells the tool to use C:\mydata as the folder that has the data files, run recursively (repeat for all files and subfolders), and create file checksums in XML format in a new file called checksums.
    2. MacOS:
      1. Open a terminal window.
      2. Go to the folder where the data file is located: cd ~\mydata
      3. At the command prompt, type: shasum * > checksums 
      4. This will calculate SHA file checksums and store them in a newly created file called checksums.
    3. Linux:
      1. Open a terminal window.
      2. Go to the folder where the data file is located: cd ~\mydata
      3. At the command prompt, type: md5sum * > checksums  
      4. This will calculate md5 file checksums and store them in a newly created file called checksums
  3. Run the command on the checksums file itself to verify its own integrity later:
    1. Windows:
      1. .\fciv.exe checksums
      2. This should print out something like:

        //
        // File Checksum Integrity Verifier version 2.05.
        //
        79ac8d043dc8739f661c45cc33fc07ac checksums

    2. MacOS:
      1. shasum checksums
      2. This should print out something like: 

        f1d2d2f924e986ac86fdf7b36c94bcdf32beec15 checksums

    3. Linux: md5 checksums
      1. md5 checksums
      2. This should print out something like: 

        cf1d2edf9bcdef8234ac66fcbaeec15 checksums

  4. Copy the checksum value somewhere where you can find it easily (e.g. in a password manager).

Verify trustworthiness

  1. The following example assumes three files in the mydata folder, namely datafile0, datafile1, and datafile2, the last of which has either been tampered with or changed due to other reasons.
    1. Windows:
      1. Type at the command prompt: .\fciv.exe checksums
      2. Verify that the checksums matches what you wrote down in Step 3(a) above.
      3. Type at the command prompt: .\fciv.exe -v -xml checksums
      4. This will re-checksum each data file and compare it against what is in the checksums file.  You should see output like this:

        Starting checksums verification : 11/19/2021 at 13h28'25

        All files verified successfully

        End Verification : 11/19/2021 at 13h28'25
        Starting checksums verification : 11/19/2021 at 13h30'22

        List of modified files:
        -----------------------
        C:\mydata\datafile2
        Hash is : 3d2333478e0ef1cf5de4ea5685266276
        It should be : 1e2cad9ca0915a7a1a8defb28b754b74

        End Verification : 11/19/2021 at 13h30'22

    2. MacOS:
      1. Type at the command prompt: shasum checksums
      2. Verify that the checksums matches what you wrote down in Step 3(b) above.
      3. Type at the command prompt: shasum -c checksums
      4. This will re-checksum each data file and compare it against what is in the checksums file.  You should see output like this:

        datafile0: OK
        datafile1: OK
        datafile2: FAILED
        shasum: WARNING: 1 computed checksum did NOT match

    3. Linux:
      1. Type at the command prompt: md5sum checksums
      2. Verify that the checksums matches what you wrote down in Step 3(c) above.
      3. Type at the command prompt: md5sum -c checksums
      4. This will re-checksum each data file and compare it against what is in the checksums file.  You should see output like this:

        datafile0: OK
        datafile1: OK
        datafile2: FAILED
        md5sum: WARNING: 1 computed checksum did NOT match

    4. Delete the unencrypted checksums file.

Email securemyresearch@iu.edu if you have other questions about cybersecurity or compliance related to research.

Approved for

We want your feedback

Please email securemyresearch@iu.edu to report errors/omissions and send critiques, suggestions for improvements, new use cases/recipes, or any other positive or negative feedback you might have.  It will be your contribution to the Cookbook and appreciated by all who use it.