Simplifying Machine Learning Workflow with YAML Files

8 min readApr 4, 2024

Use YAML files to store and manage configuration settings for your Machine Learning models, promotes code reusability, manage MLOps pipelines, and more. In this blog, we will explore the role of YAML files in Machine Learning projects.

Outline

Some Background Story
YAML and It’s Syntax
Reading and Creating a YAML file in Python
Advantages of using YAML in Machine Learning Projects
Conclusion

Some Background Story

When I started working on Machine Learning projects ~5 years back, I thought of YAML as some Cloud DevOps nonsense. As you know, Machine Learning practitioner and DevOps don’t usually go well together, as we like to focus on the more “cool” part, i.e., ML.

However, as I started working on more extensive projects with a very large scale of hundreds of experiments every month, I began to appreciate YAML for experimentation more and more.

Given the NDA constraints of my work, I can’t share much information about how I exactly used it, but I can say that it made my life significantly easier.

YAML for ML projects has not been discussed broadly, so I’m writing about it as this really needs more recognition.

⚠️ Please note — If you’re already familiar with YAML and its syntax, navigate to the 👉🏼 “Advantages of using YAML files in Machine Learning Projects” section for real deal :)

YAML and It’s Syntax

YAML (YAML Ain’t Markup Language) is a human-readable data serialization format used for structuring and representing data. It was designed to be easily readable and writable by humans, while still being machine-readable. YAML is commonly used for configuration files, data exchange, and data serialization in various applications, including Machine Learning projects.

YAML Syntax

The structure of a YAML file is based on key-value pairs and indentation. It supports diverse data structures such as scalars (strings, numbers, booleans), lists (arrays), dictionaries (maps), and nested structures. Below is an example YAML document for reference :

---
# Example YAML file
# This is a comment

person: &person_anchor  # anchor for 'person' mapping
  name: John Doe
  age: 35
  address:
    street: 123 Main St.
    city: Anytown
    state: CA
    zip: '12345'
  interests:
    - reading  # sequence (list)
    - hiking
    - traveling

# Using an anchor reference
another_person: *person_anchor

multi_line_string: |  # multi-line string
  This is a
  multi-line
  string.

# Scalar examples
integer_scalar: 42
float_scalar: 3.14
boolean_scalar: true
null_scalar: null

---
# Another YAML document
# This is a separate document

Basic details on elements of the YAML document shared above:

Mappings (Dictionaries): In YAML, mappings represent dictionaries or key-value pairs. Nested mappings are also presented.
Sequences (Lists): Sequences represent lists or arrays in YAML.
Scalars: Scalars are single values like strings, numbers, and booleans.
Anchors and Aliases: YAML supports anchors and aliases, which allow you to reuse and reference values across the document.
Multi-line Strings: YAML can represent multi-line strings using the pipe “|” character.
Comments: YAML supports comment, which start with the “#” symbol.
Document Separators: YAML documents can be separated using “ — -” at the beginning of each document.

For a better understanding of YAML and its syntax, I strongly suggest reading this blog:

Learn YAML in five minutes!

This quick read will teach you the basics of YAML markup language in the time it takes to make a cup of tea :)

www.codeproject.com

Reading and Creating a YAML file in Python

Reading a YAML file :

To read and work with YAML files in Python, we can use the PyYAML library. Example to convert the previous YAML file content into a Python dictionary using PyYAML:

import yaml

# Read the YAML file
with open('example.yaml', 'r') as file:
    data = yaml.safe_load(file)

# Access the data
print(data)

The resulting output will be like this:

[
    {
        'person': {
            'name': 'John Doe',
            'age': 35,
            'address': {
                'street': '123 Main St.',
                'city': 'Anytown',
                'state': 'CA',
                'zip': '12345'
            },
            'interests': [
                'reading',
                'hiking',
                'traveling'
            ]
        },
        'another_person': {
            'name': 'John Doe',
            'age': 35,
            'address': {
                'street': '123 Main St.',
                'city': 'Anytown',
                'state': 'CA',
                'zip': '12345'
            },
            'interests': [
                'reading',
                'hiking',
                'traveling'
            ]
        },
        'multi_line_string': 'This is a\nmulti-line\nstring.\n',
        'integer_scalar': 42,
        'float_scalar': 3.14,
        'boolean_scalar': True,
        'null_scalar': None
    },
    {}
]

Creating a YAML file is a breeze! 😃

Here’s how to do it

import yaml

# Simply input your data either in Python dict format or read from JSON.
data = {
  "model": {
    "type": "cnn",
    "filters": 32,
    "kernel_size": 3
  }
}
# Create a YAML file
with open("config.yaml", "w") as file:
    yaml.dump(data, file, default_flow_style=False)

“config.yaml” YAML file contains:

{
  "model": {
    "type": "cnn",
    "filters": 32,
    "kernel_size": 3
  }
}

Advantages of using YAML files in Machine Learning Projects

You might have already came across yaml file being used over different AI repos and packages. For example :

Yolo model uses YAML to configure Model hyperparameters, classes and data, etc.
Detectron2 by Meta uses YAML to configure model weights, etc.

Well a bonus point 🙂, in the upcoming fourth part of the PyTorch Fine tuning series on the blog, I will share a boilerplate code for fine-tuning a vision model. This code will be similar to what I use for running hundreds of experiments monthly. There, I’ll show the usage of YAML files for managing model configuration. You can read the remaining parts of this series here:

Ruman

Ultimate Guide to Fine-Tuning in PyTorch

View list

4 stories

Some key benefits of YAML for machine learning :

Model Configuration Management

YAML files can be used to store and manage configuration settings for your Machine Learning models, such as hyperparameters, model architectures, data preprocessing steps, and other configurations. This promotes code reusability and makes it easy to experiment with different configurations.

Example 1: Defining Model Architecture

You can define the model architectures in YAML files, specifying the layers, activation functions, and other parameters. This promotes code reusability and allows you to experiment with different architectures without modifying your core training code.

# model_architecture.yaml
model:
  name: ResNet50
  architecture:
    - layer:
        type: Conv2D
        filters: 64
        kernel_size: 7
        stride: 2
        activation: relu
        padding: same
    - layer:
        type: MaxPooling2D
        pool_size: 3
        strides: 2
    - layer:
        type: ResidualBlock
        filters: 64
        blocks: 3
    - layer:
        type: ResidualBlock
        filters: 128
        blocks: 4
        stride: 2
    - layer:
        type: ResidualBlock
        filters: 256
        blocks: 6
        stride: 2
    - layer:
        type: ResidualBlock
        filters: 512
        blocks: 3
        stride: 2
    - layer:
        type: GlobalAveragePooling2D
    - layer:
        type: Dense
        units: 1000
        activation: softmax

In above example, we have defined the model architecture for the ResNet50 convolutional neural network using a YAML file. The architecture is represented as a list of layers, with each layer specifying its type (e.g., Conv2D, MaxPooling2D, ResidualBlock), parameters (e.g., filters, kernel_size, stride), and activation functions.

Example 2: Using for Model Hyperparameters

Say your fine tuning a vision model for image classification, and you want to experiment with different hyperparameters such as the learning rate, batch size, and the number of filters in the convolutional layers. You can create separate YAML files for each experiment configuration:

# model_hyperparameters.yaml
model:
  name: VGG16
  hyperparameters:
    optimizer:
      type: adam
      learning_rate: 0.001
      beta_1: 0.9
      beta_2: 0.999
    loss: categorical_crossentropy
    metrics:
      - accuracy
      - top_k_categorical_accuracy

training:
  batch_size: 32
  epochs: 100
  early_stopping:
    monitor: val_loss
    patience: 10
    restore_best_weights: true

data:
  augmentation:
    horizontal_flip: true
    rotation_range: 20
    width_shift_range: 0.2
    height_shift_range: 0.2
    shear_range: 0.2
    zoom_range: 0.2

In above example, we have defined the hyperparameters for a VGG16 model using a YAML file. The hyperparameters include optimizer settings (e.g., Adam with specific beta values), loss function, evaluation metrics, batch size, number of epochs, and early stopping criteria.

Additionally, we included the data augmentation techniques, such as horizontal flipping, rotation, shifting, shearing, and zooming, which can be applied during the training process.

Example 3: Data Preprocessing Configurations

YAML can also be used to manage the configurations for data preprocessing steps, such as feature scaling, text preprocessing, image augmentation, etc. This keeps your preprocessing logic separate from your model training code, making it easier to experiment with different preprocessing strategies.

# data_preprocessing.yaml
data:
  dataset: CIFAR-10
  preprocessing:
    image_processing:
      resize:
        height: 32
        width: 32
      normalization:
        mean: [0.4914, 0.4822, 0.4465]
        std: [0.2023, 0.1994, 0.2010]
    text_processing:
      tokenizer:
        type: word_level
      vectorizer:
        type: tfidf
        max_features: 10000
    tabular_processing:
      numerical_features:
        - age
        - income
        - credit_score
      categorical_features:
        - education
        - marital_status
      encoding:
        numerical: standard
        categorical: one_hot

In above example, we define the data preprocessing configurations for different data types: images, text, and tabular data. The YAML file specifies the dataset name (CIFAR-10) and includes separate preprocessing steps for each data type.

For image processing, we have defined resizing and normalization operations. For text processing, we have defined the tokenization technique (word-level) and vectorization method (TF-IDF with a maximum of 10,000 features). For tabular data processing, we have listed the numerical and categorical features, and defined the encoding methods for each (standard scaling for numerical features and one-hot encoding for categorical features).

MLOps Pipelines

YAML is commonly used in MLOps pipelines for defining and orchestrating the various steps involved in the ML lifecycle, such as data ingestion, preprocessing, model training, evaluation, and deployment.

In a production ML system for predictive maintenance, you can define the entire pipeline in a YAML file, including steps for data ingestion from various sources, model training with specific algorithms, evaluation metrics, and deployment strategies.

Example 1: MLOps Pipeline for Deep Learning

Let’s assume you have a vision project where you need to train and deploy a YOLO model for object detection. You can define the entire MLOps pipeline using YAML, including data processing, model training, evaluation, and deployment stages.

# mlops_pipeline.yaml
data_processing:
  image_preprocessing:
    resize:
      width: 224
      height: 224
    normalization: true

model:
  type: cnn
  architecture:
    - layer: Conv2D
      filters: 32
      kernel_size: 3
      activation: relu
    - layer: MaxPooling2D
      pool_size: 2
    # ... (additional layers)

training:
  optimizer:
    type: adam
    learning_rate: 0.001
  loss: categorical_crossentropy
  metrics:
    - accuracy
  epochs: 50
  batch_size: 32

evaluation:
  metrics:
    - precision
    - recall
    - f1-score

deployment:
  platform: kubernetes
  resources:
    cpu: 2
    memory: 4Gi
  ingress:
    host: object-detection.example.com

With this YAML definition, you can use MLOps tools like Kubeflow, AWS SageMaker Pipelines, or Google Vertex AI to orchestrate the entire pipeline. These tools can read the YAML file, provision the necessary resources, and execute each stage of the pipeline as defined.

Human-Readable and Editable

YAML files are human-readable and easily editable, making it easier for data scientists and engineers to collaborate and share configurations or data structures without needing to modify code directly.

Language-Agnostic

YAML is a language-agnostic format, which means that data and configurations stored in YAML can be easily shared and consumed across different programming languages and frameworks used in your Machine Learning project.

Conclusion

YAML is a powerful and versatile data serialization format that can greatly benefit Machine Learning projects.

Its human-readable syntax, hierarchical structure, and language-agnostic nature make it an ideal choice for managing configurations, serializing data, and orchestrating MLOps pipelines.

By leveraging YAML, you can streamline your Machine Learning projects, promote code reusability, and collaborate more effectively.

To conclude ,

There are numerous ways in which ML engineers are leveraging YAML in their projects, from training to deployment. It ultimately boils down to how you optimize your ML project workflow using a YAML file.

If you enjoyed this article, your applause would be greatly appreciated!