Dataset preprocessing with AUDIT

In this tutorial, we will walk you through the preprocessing steps of the LUMIERE dataset using the AUDIT library. Our goal is to demonstrate how to clean, organize, and standardize the dataset to prepare it for further analysis. We'll cover key operations such as file and folder cleaning, reorganization, and renaming, all while explaining each step in detail.

Let's get started!

0. Import Library's Functions

Before we dive into the dataset preprocessing, we first need to import the necessary functions from the AUDIT library. These utility functions will help us clean and reorganize the dataset. Importing these functions at the start will give us all the tools we need for the upcoming steps.

If you have already installed AUDIT a library, then you'll be able to import the main functions.

import os
from audit.utils.commons.file_manager import (
    list_dirs,
    list_files,
    delete_files_by_extension,
    delete_folders_by_pattern,
    move_files_to_parent,
    organize_subfolders_into_named_folders,
    rename_files,
    add_string_filenames,
    rename_directories
)

1. Data understanding

Before diving into the preprocessing tasks, it is essential to gain an understanding of the dataset's structure. This helps us identify the key elements we will be working with, such as the available sequences and segmentation data. The LUMIERE dataset contains images captured over multiple time points, so we'll need to identify and remove the unwanted data, keeping only what’s relevant for analysis.

In this section, we will explore the overall structure of the dataset, focusing on the folders and files that need our attention.

To follow along, download the LUMIERE dataset using this link: DOWNLOAD LUMIERE

root_data_path = "./datasets/LUMIERE/"

Exploring the Dataset

We begin by checking the main directory of the dataset to understand its organization. We observe that there are 91 directories, each corresponding to a patient.

print(list_dirs(root_data_path)[:5])
print(list_dirs(root_data_path)[-5:])

['Patient-001', 'Patient-002', 'Patient-003', 'Patient-004', 'Patient-005']
['Patient-087', 'Patient-088', 'Patient-089', 'Patient-090', 'Patient-091']

Each patient folder contains subdirectories for different timepoints, typically named in the format "week-XXX", where XXX corresponds to the week the image was taken. We may also encounter directories with additional suffixes like "-N" for specific timepoints.

To streamline our tutorial, we will focus only on the core timepoints and exclude those with the "-N" suffix.

print(list_dirs(f"{root_data_path}Patient-001"))
print(list_dirs(f"{root_data_path}Patient-091"))

['week-000-1', 'week-000-2', 'week-044', 'week-056']
['week-000', 'week-001', 'week-014', 'week-026', 'week-036', 'week-043']

Reviewing the data at each timepoint

Each timepoint contains a variety of sequences along with segmentation data. For example, in the case of patient 091, we examine the week-0000 folder to find different sequence data along with segmentation predictions generated by two models: "DeepBraTumIA-segmentation" and "HD-GLIO-AUTO-segmentation".

For this tutorial, we will consider the segmentation provided by "DeepBraTumIA-segmentation" as the ground truth.

print(list_dirs(f"{root_data_path}Patient-091/week-000"))
print(list_files(f"{root_data_path}Patient-091/week-000"))

['DeepBraTumIA-segmentation', 'HD-GLIO-AUTO-segmentation']
['CT1.nii.gz', 'FLAIR.nii.gz', 'T1.nii.gz', 'T2.nii.gz']

The sequences that are of primary interest to us are stored in the "atlas/skull_strip" directory, and the corresponding segmentation is found in the "atlas/segmentation" directory.

print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas/skull_strip"))
print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas/segmentation"))

['brain_mask.nii.gz', 'ct1_skull_strip.nii.gz', 'flair_skull_strip.nii.gz', 't1_skull_strip.nii.gz', 't2_skull_strip.nii.gz']
['measured_volumes_in_mm3.json', 'seg_mask.nii.gz']

2. Files cleaning

Now that we understand the structure, the next step is to clean the dataset by removing unnecessary files. This involves eliminating sequences and files that we won’t use in our analysis. The files we want to remove are mainly the raw image sequences ('CT1.nii.gz', 'FLAIR.nii.gz', 'T1.nii.gz', 'T2.nii.gz'), as well as other irrelevant files like 'brain_mask.nii.gz' and 'measured_volumes_in_mm3.json'.

In this section, we will walk you through the process of cleaning these files using AUDIT functions.

delete_files_by_extension(
    root_dir=root_data_path,
    ext='CT1.nii.gz',
    safe_mode=True
)

[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-001/week-000-1/CT1.nii.gz
[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-001/week-000-2/CT1.nii.gz
[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-001/week-044/CT1.nii.gz
....
....
[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-091/week-036/CT1.nii.gz
[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-091/week-043/CT1.nii.gz

First, we use the delete_files_by_extension function in safe mode to preview the files that will be deleted. After confirming that these are indeed the unnecessary files, we proceed to remove them.

sequences_to_delete = ['CT1.nii.gz', 'FLAIR.nii.gz', 'T1.nii.gz', 'T2.nii.gz']
for seq in sequences_to_delete:
    delete_files_by_extension(
        root_dir=root_data_path,
        ext=seq,
        safe_mode=False
    )

After deletion, we can verify that the unnecessary files have been removed from the timepoint folders.

print(list_dirs(f"{root_data_path}Patient-091/week-000"))
print(list_files(f"{root_data_path}Patient-091/week-000"))

Additionally, we clean up other unnecessary files like 'brain_mask.nii.gz' and 'measured_volumes_in_mm3.json'.

files_to_delete = ['.json', 'brain_mask.nii.gz']
for file in files_to_delete:
    delete_files_by_extension(
        root_dir=root_data_path,
        ext=file,
        safe_mode=False
    )

3. Folders cleaning

In this step, we remove unnecessary folders that contain irrelevant data, such as "HD-GLIO-AUTO-segmentation" and "DeepBraTumIA-segmentation/atlas/". This helps us further simplify the structure, keeping only the essential files and folders for our analysis.

delete_folders_by_pattern(
    root_dir=root_data_path,
    pattern="HD-GLIO",
    safe_mode=True
)

[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-088/week-000-2/HD-GLIO-AUTO-segmentation
[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-088/week-000-1/HD-GLIO-AUTO-segmentation
....
[SAFE MODE] Would delete: ./datasets/LUMIERE/Patient-078/week-119/HD-GLIO-AUTO-segmentation

delete_folders_by_pattern(
    root_dir=root_data_path,
    pattern="HD-GLIO",
    safe_mode=False
)

delete_folders_by_pattern(
    root_dir=root_data_path,
    pattern="native",
    safe_mode=False
)

After cleaning the folders, we verify that the unnecessary directories have been removed.

# No unnecessary files in the folders.
print(list_files(f"{root_data_path}Patient-091/week-000/"))
print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/"))
print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas"))

# No unnecessary directories in the folders.
print(list_dirs(f"{root_data_path}Patient-091/week-000/"))
print(list_dirs(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/"))
print(list_dirs(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas"))

[]
[]
[]
['DeepBraTumIA-segmentation']
['atlas']
['segmentation', 'skull_strip']

4. Files organization

Now that we have cleaned up the unnecessary files and folders, it’s time to organize the remaining files. Currently, they are nested in deep directory structures, and we need to move them to the parent folders for easier access.

In this section, we will use the move_files_to_parent function to simplify the file structure by moving the necessary files to their corresponding parent directories.

move_files_to_parent(
    root_dir=root_data_path,
    levels_up=3,
    ext=None,
    safe_mode=False
)

This will move the files up from their current deep folder structure to the appropriate parent folders. Let's verify that everything has been moved correctly.

print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas/skull_strip"))
print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas/segmentation"))
print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/atlas/"))
print(list_files(f"{root_data_path}Patient-091/week-000/DeepBraTumIA-segmentation/"))
print(list_files(f"{root_data_path}Patient-091/week-000/"))
print(list_files(f"{root_data_path}Patient-002/week-047/"))

[]
[]
[]
[]
['ct1_skull_strip.nii.gz', 'flair_skull_strip.nii.gz', 'seg_mask.nii.gz', 't1_skull_strip.nii.gz', 't2_skull_strip.nii.gz']
['ct1_skull_strip.nii.gz', 'flair_skull_strip.nii.gz', 'seg_mask.nii.gz', 't1_skull_strip.nii.gz', 't2_skull_strip.nii.gz']

We also remove any remaining directories we no longer need.

delete_folders_by_pattern(
    root_dir=root_data_path,
    pattern="DeepBraTumIA-segmentation",
    safe_mode=False
)

[]
['ct1_skull_strip.nii.gz', 'flair_skull_strip.nii.gz', 'seg_mask.nii.gz', 't1_skull_strip.nii.gz', 't2_skull_strip.nii.gz']

5. Folders organization

At this point, we need to organize the folders further. To align with the expected structure for AUDIT, we will move the timepoint folders to the root level. This way, each subject and their respective timepoints will be placed in a well-organized directory.

The function organize_subfolders_into_named_folders will help us organize subdirectories by joining their parent and child folder names. The "join" argument defines the string used to concatenate the parent folder and the child folder. Check the documentation for more detailed information.

organize_subfolders_into_named_folders(
    root_dir=root_data_path,
    join_char="-",
    safe_mode=True
)

[SAFE MODE] Would move: ./datasets/LUMIERE/Patient-001/week-000-2/t2_skull_strip.nii.gz -> ./datasets/LUMIERE/Patient-001-week-000-2/t2_skull_strip.nii.gz
[SAFE MODE] Would move: ./datasets/LUMIERE/Patient-001/week-000-2/ct1_skull_strip.nii.gz -> ./datasets/LUMIERE/Patient-001-week-000-2/ct1_skull_strip.nii.gz
....
[SAFE MODE] Would move: ./datasets/LUMIERE/Patient-091/week-014/flair_skull_strip.nii.gz -> ./datasets/LUMIERE/Patient-091-week-014/flair_skull_strip.nii.gz

Now, we will move the subfolders into the root directory and rename them as necessary.

organize_subfolders_into_named_folders(
    root_dir=root_data_path,
    join_char="-",
    safe_mode=False
)

print(list_dirs(root_data_path)[:6])

['Patient-001-week-000-1', 'Patient-001-week-000-2', 'Patient-001-week-044', 'Patient-001-week-056', 'Patient-002-week-000', 'Patient-002-week-003']

As mentioned earlier, for simplicity, we will only keep the timepoints that do not have a "-N" suffix at the end of the corresponding week.

pattern_to_delete = r"^Patient-\d{3}-week-\d{3}-\d"

delete_folders_by_pattern(
    root_dir=root_data_path,
    pattern=pattern_to_delete,
    safe_mode=False
)

print(list_dirs(root_data_path)[:6])

['Patient-001-week-044', 'Patient-001-week-056', 'Patient-002-week-000', 'Patient-002-week-003', 'Patient-002-week-021', 'Patient-002-week-037']

6. Sequences name standardization

Finally, to follow a more standardized naming convention, such as the one used in the BraTS dataset, we will rename the sequences and the segmentation to follow a similar pattern. Typically, MRI sequences are named t1, t2, t1ce, and flair, and the segmentation is named seg. However, the names we currently have do not follow this convention. Let's use rename_files to modify them.

print(list_files(f"{root_data_path}Patient-001-week-044"))

['ct1_skull_strip.nii.gz', 'flair_skull_strip.nii.gz', 'seg_mask.nii.gz', 't1_skull_strip.nii.gz', 't2_skull_strip.nii.gz']

old_names = ['ct1_skull_strip.nii.gz', 'flair_skull_strip.nii.gz', 'seg_mask.nii.gz', 't1_skull_strip.nii.gz', 't2_skull_strip.nii.gz']
new_names = ['t1ce.nii.gz', 'flair.nii.gz', 'seg.nii.gz', 't1.nii.gz', 't2.nii.gz']

for o, n in zip(old_names, new_names):
    rename_files(
        root_dir=root_data_path,
        old_name=o,
        new_name=n,
        safe_mode=True
    )

[SAFE MODE] Would rename: ./datasets/LUMIERE/Patient-012-week-016/ct1_skull_strip.nii.gz -> ./datasets/LUMIERE/Patient-012-week-016/t1ce.nii.gz
[SAFE MODE] Would rename: ./datasets/LUMIERE/Patient-023-week-001/ct1_skull_strip.nii.gz -> ./datasets/LUMIERE/Patient-023-week-001/t1ce.nii.gz
....
[SAFE MODE] Would rename: ./datasets/LUMIERE/Patient-077-week-083/t2_skull_strip.nii.gz -> ./datasets/LUMIERE/Patient-077-week-083/t2.nii.gz

for o, n in zip(old_names, new_names):
    rename_files(
        root_dir=root_data_path,
        old_name=o,
        new_name=n,
        safe_mode=False
    )

Additionally, to allow AUDIT to locate each image simply by the subject ID, we will name each image with the corresponding subject identifier along with the sequence name. To do this, we will use the add_string_filenames function, which allows us to add both suffixes and prefixes to specific files.

for subject in list_dirs(root_data_path):
    add_string_filenames(
        root_dir=os.path.join(root_data_path, subject),
        prefix=f"{subject}_",
        ext=None,
        safe_mode=False
    )

With this, we would have organized the project as required to work with AUDIT. Additionally, we recommend that the images (sequences and segmentations provided by the medical experts) be placed in a directory called DATASET_images, so that the segmentations from each model are contained in the DATASET_seg directory. Therefore, to conclude, we'll rename the LUMIERE directory to LUMIERE_images.

rename_directories(
    root_dir="./datasets/",
    old_name="LUMIERE",
    new_name="LUMIERE_images",
    safe_mode=False
)

Conclusion

By following these steps, we have successfully cleaned, organized, and standardized the LUMIERE dataset, making it ready for further analysis. The AUDIT library has provided a powerful toolkit for efficiently preprocessing and structuring the data.