How to Build Lightweight Data Pipeline and MLOps Server in On-premises Environment

Streamlined Airflow & MLflow Operations, Enhanced by PostgreSQL. Leverage Feast’s Online Feature Store with Redis, and Tap into Kafka. Monitor with NannyML and Process data with Polars.

Jun 18, 2024

🌟 From Data Science to MLOps: Navigating the On-Premises Frontier

As the sun sets on my days as a data scientist, I find myself standing at the crossroads of transformation. The journey from feature engineering to model deployment has led me to a new horizon: MLOps. In this article, I’ll unravel the intricacies of building a lightweight data pipeline and MLOps server within the confines of an on-premises environment. Buckle up—we’re about to embark on a voyage where code meets infrastructure, and insights collide with reality. 🚀🔍

P.S: I was motivated to share this post because I have experienced the transition from a Data Science (DS) to a Machine Learning (ML) role, which required hands-on MLOps work in an on-premises environment.

Ready to embark on this transformative journey? Let’s dive in! 🌟

Start small and quick production — start ML model production small and quick…

Unlock the Power of On-Premises AI/ML Deployment with Our Cutting-Edge Template

Embark on a journey of seamless data management and model deployment with our meticulously crafted template, tailored for organizations that demand efficiency and precision. This solution is the perfect fit for handling lightweight batch data processing, streaming data, and deploying a select suite of models in an on-premises setting.

At the heart of our environment lies a robust trio: Airflow, MLFlow, and Feast, each anchored by a PostgreSQL database, ensuring unparalleled data integrity. Our online feature store taps into the speed of Redis, while the Feast push server and Apache Kafka support cater to your streaming ingestion needs. This infrastructure is not just about managing workflows; it’s about empowering your data pipeline and elevating model tracking to new heights.

For those who venture into the realm of big data, our template is designed to grow with you. Integrate powerhouse technologies like Kubernetes, Kubeflow, Hadoop, Hive, and Apache Spark to unlock distributed computing capabilities that will scale your data processing to meet the demands of the big data challenge.

Below diagram depicts the overall view of our MLOps environment:

Building a Robust MLOps Environment with Docker Compose

Embark on the creation of a state-of-the-art MLOps system that is both versatile and robust. Our journey begins with our Docker Compose file, which outlines a suite of services designed to streamline your data operations.

Database Service: The Foundation

Our system kicks off with a PostgreSQL database service, constructed from a custom Dockerfile within the ./docker/postgres directory. It’s configured to restart automatically and logs are managed to ensure optimal performance. The service is exposed on port 32769 and includes volume mappings for persistent data storage and initialization scripts. Crucial environment variables define the user, password, and multiple databases (mlops_db, mlflow_db, feast_db) to cater to our diverse data needs. A health check ensures the database is ready to handle queries before other services start.

######################################################
# DATABASE SERVICE
######################################################
  postgres:
    build: './docker/postgres'
    restart: always
    container_name: postgres
    logging:
      driver: "json-file"
      options:
          max-file: "5"
          max-size: "10m"
    ports:
      - "32769:5432"
    volumes:
      - ./mnt/postgres/pgdata:/var/lib/postgresql/data/pgdata
      - ./docker/postgres/entrypoint_initdb:/docker-entrypoint-initdb.d
    environment:
      - POSTGRES_USER=mlops
      - POSTGRES_PASSWORD=mlops
      - POSTGRES_MULTIPLE_DATABASES=mlops_db, mlflow_db, feast_db
      - PGDATA=/var/lib/postgresql/data/pgdata
    healthcheck:
      test: [ "CMD", "pg_isready", "-q", "-d", "mlops_db", "-U", "mlops" ]
      timeout: 45s
      interval: 10s
      retries: 10

POSTGRES_MULTIPLE_DATABASES: This is the name of the environment variable. It’s not a standard PostgreSQL environment variable. The command of multiple database creation in docker-compose enables robust handling of databse management. By mounting entrypoint_initdb.sh to /docker-entrypoint-initdb.d, automatically run the script when the container launchs and create the databases according to this environment variable.

Adminer: Database Management Made Simple

Adminer, a lightweight database management tool, is set up to connect to our PostgreSQL service. It provides a web interface accessible on port 32767 for easy database operations.

  adminer:
    image: wodby/adminer:latest
    restart: always
    container_name: adminer
    logging:
      driver: "json-file"
      options:
          max-file: "5"
          max-size: "10m"
    ports:
      - "32767:9000"
    environment:
      - ADMINER_DEFAULT_DB_DRIVER=psql
      - ADMINER_DEFAULT_DB_HOST=postgres
      - ADMINER_DEFAULT_DB_NAME=mlops_db
    healthcheck:
      test: [ "CMD", "nc", "-z", "adminer", "9000" ]
      timeout: 45s
      interval: 10s
      retries: 10

Open browser and enter http://localhost:32767 to launch UI as you need.

Airflow: Orchestrating Data Workflows

The Airflow service, built from ./docker/airflow, orchestrates our data pipelines. It depends on the health of the PostgreSQL service and mounts several volumes for configuration, DAGs, logs, plugins, and projects. Airflow’s webserver is available on port 8080, and a health check ensures it’s ready to manage our workflows.

######################################################
# AIRFLOW
######################################################

  airflow:
    build: ./docker/airflow
    restart: always
    container_name: airflow
    depends_on:
      postgres:
        condition: service_healthy
    volumes:
      - ./mnt/airflow/airflow.cfg:/opt/airflow/airflow.cfg
      - ./mnt/airflow/dags:/opt/airflow/dags
      - ./mnt/airflow/logs:/opt/airflow/logs
      - ./mnt/airflow/plugins:/opt/airflow/plugins
      - ./mnt/airflow/feast:/opt/airflow/projects
    ports:
      - 8080:8080
    healthcheck:
      test: [ "CMD", "nc", "-z", "airflow", "8080" ]
      timeout: 45s
      interval: 10s
      retries: 10

The structure would look like:

airflow
│   airflow.cfg
│
├───dags
│   README.md
│   sample_feast_dag.py
│   sample_mlflow_dag.py
│   sample_postgres_dag.py
│
├───files
│       
├───nannyml
│
├───scripts
│   
├───sql
│   birth_date.sql
│   pet_schema.sql
│   populate_pet_table.sql
│
├───feast
│
├───logs
│
└───plugins

The key to harnessing the full potential of Airflow lies in understanding its directory structure—a well-organized framework that ensures every piece of your data pipeline is exactly where it needs to be. Let’s dive into the anatomy of an Airflow setup:

Root Directory: The Command Center At the heart of it all is the airflow directory, your command center. Here, you’ll find the airflow.cfg file, the mastermind configuration that dictates how your Airflow environment behaves.
DAGs Folder: The Blueprint Collection Within the dags folder, creativity meets execution. It’s home to your DAGs (Directed Acyclic Graphs), the blueprints of your data workflows. This folder includes:
- sample_feast_dag.py: A sample DAG demonstrating Feast integration.
- sample_mlflow_dag.py: An example DAG for MLflow operations. This is sklearn with catboost and iris dataset.
- sample_postgres_dag.py: A template DAG for PostgreSQL tasks.
Files Folder: The Resource Library The files directory stands ready to store any additional resources your DAGs might need, acting as a centralized intemediate data or file resource.
NannyML Folder: This folder intended to store any NannyML outputs/plot, monitoring the performance and health of ML models.
Scripts Folder: The scripts directory is your directory for saving any codes for your work.
SQL Folder: The Query Vault The sql folder is the query vault for airflow DAGs, containing SQL scripts such as:
- birth_date.sql: A script to handle birth date to query the database.
- pet_schema.sql: The schema definition for a pet database.
- populate_pet_table.sql: A script to populate the pet table with data.
Feast Folder: The Feature Feast The feast directory is where you’ll save Feast work from airflow, an essential feature store for your ML models, ensuring they’re always fed with the freshest data.
Logs Folder: The Chronicle The logs folder is the chronicle of your Airflow journey, capturing every step, every task, and every run, providing you with a comprehensive history of your operations.
Plugins Folder: The Customization Corner Last but not least, the plugins directory is the customization corner, allowing you to extend Airflow’s capabilities with custom plugins tailored to your needs.

This structured approach not only brings order to potential chaos but also empowers you to build, manage, and monitor your data workflows with unparalleled efficiency. Embrace the structure, and let Airflow propel your data operations to new heights.

MLflow: Tracking Experiments with Ease

MLflow is our experiment tracking tool, built from ./docker/mlflow. It stores data in a dedicated volume and offers its UI on port 5000. Like Airflow, it waits for PostgreSQL to be healthy before starting.

######################################################
# MLFLOW
######################################################

  mlflow:
    build: ./docker/mlflow
    restart: always
    container_name: mlflow
    depends_on:
      postgres:
        condition: service_healthy
    volumes:
      - ./mnt/mlflow:/mlflow
    ports:
      - 5000:5000

The Setup: With a swift update to the package list and installation of essential build dependencies, we pave the way for MLflow and its PostgreSQL companion, psycopg2. This ensures seamless interaction with the PostgreSQL database, setting the stage for robust data tracking.

# Base Image
FROM python:3.8-slim-buster

WORKDIR /mlflow

RUN apt-get update && \
    apt-get -y install libpq-dev gcc && \
    pip install mlflow psycopg2

ENV BACKEND_URI postgresql+psycopg2://mlops:mlops@postgres:5432/mlflow_db

EXPOSE 5000

CMD mlflow server --backend-store-uri $BACKEND_URI --host 0.0.0.0 --port 5000

We leave --default-artifact-root option as default for saving MLflow artifact in container mount folder for data persistence.

Feast: Feature Store for Machine Learning

Our feature store is powered by Feast and Redis. Redis, running on redis:7.2-alpine, provides a fast in-memory NoSQL data store, while Feast serves the online feature store and push server, facilitating streaming ingestion with links to Redis. Both services have health checks to ensure reliability.

######################################################
# FEAST
######################################################

  redis:
    image: redis:7.2-alpine
    restart: always
    volumes:
      - ./mnt/redis:/data
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 30s
      retries: 50
      start_period: 30s

  feature_server:
    container_name: feast_feature_server
    build: ./docker/feast
    restart: always
    depends_on:
      postgres:
        condition: service_healthy
    ports:
      - "6566:6566"
    links:
      - redis
    healthcheck:
      test: "${DOCKER_HEALTHCHECK_TEST:-curl localhost:6566/health}"
      interval: "5s"
      retries: 5

  push_server:
    container_name: feast_push_server
    build: ./docker/feast
    restart: always
    depends_on:
      feature_server:
        condition: service_healthy
      postgres:
        condition: service_healthy
    ports:
      - "6567:6566"
    links:
      - redis

The push server you’ve deployed using Docker Compose on port 6567 offers several benefits, particularly in the context of feature management in machine learning operations (MLOps). Here’s why it’s beneficial:

Independent Scaling: By separating the write operations (pushing features) from the read operations (serving features), you can scale each part independently based on their specific load and performance requirements. For instance, if you have a high frequency of feature updates but relatively stable read patterns, you can allocate more resources to the push server without over-provisioning the serving component.
Resource Isolation: This separation also means that intensive write operations won’t interfere with the performance of read operations. It ensures that the feature serving remains fast and responsive, even when there’s a heavy load on the push server.
Simplified Architecture: Pushing features through an HTTP request to the push server simplifies the architecture.
Flexibility in Feature Management: The ability to push features through an HTTP request provides flexibility. You can easily integrate this process with different parts of your MLOps pipeline or even trigger pushes programmatically from various services or applications.
Enhanced Security: Having a dedicated push server allows you to implement security measures specifically for write operations, such as authentication and authorization, without complicating the read-side security.

We can request push to online store like below in modeule with Python.

requests.post("http://push_server:6567/push", data=json.dumps(push_data))

Kafka: Real-Time Data Streaming

The Kafka ecosystem includes Zookeeper and a Broker, ensuring a robust messaging system for real-time data streaming. The broker service exposes ports 9092 and 29092 for internal and external communication, respectively.

######################################################
# KAFKA
######################################################

  zookeeper:
    image: confluentinc/cp-zookeeper:7.0.1
    container_name: zookeeper
    restart: always
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  broker:
    image: confluentinc/cp-kafka:7.0.1
    container_name: broker
    restart: always
    ports:
      - "9092:9092"
      - "29092:29092"
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_INTERNAL:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092,PLAINTEXT_INTERNAL://broker:29092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1

  kafka_events:
    build: ./docker/kafka
    restart: always
    depends_on:
      - broker
    container_name: kafka_events

Kafka Events: Custom Event Handling

A custom Kafka Events service is built to handle specific event processing needs, completing our MLOps system setup.

  kafka_events:
    build: ./docker/kafka
    restart: always
    depends_on:
      - broker
    container_name: kafka_events

The dockerfile will invoke Kafka producer to fetch parquet data to stream online.

FROM python:3.8-slim

# Copy app handler code
COPY ./kafka_sample.py kafka_sample.py
COPY ./driver_stats.parquet .

# Install dependencies
RUN pip3 install kafka-python==2.0.2
RUN pip3 install pandas
RUN pip3 install pyarrow

CMD [ "python", "-u", "kafka_sample.py", "--mode", "setup", "--bootstrap_servers", "broker:29092"]

In conclusion

by leveraging Docker Compose, we’ve created an MLOps environment that is not only comprehensive but also modular and versatile. Each service is meticulously configured to work in harmony, providing a seamless experience for managing lightweight data pipelines and deploying AI/ML models.

Summary of Work

Functionality:

Streamlined Workflow Management: The use of Airflow allows for the orchestration of complex data workflows, automating and simplifying the process of managing data pipelines.
Experiment Tracking: MLflow provides a platform for tracking experiments, managing the ML lifecycle, and storing artifacts, making it easier to reproduce results and iterate on models.
Feature Store Integration: Feast serves as an online feature store, interfacing with Redis for quick data retrieval, which is essential for real-time machine learning applications.
Real-Time Data Streaming: The inclusion of Apache Kafka enables the handling of high-throughput, real-time data streams, which is crucial for timely insights and decision-making.
Database Service: A robust PostgreSQL backend ensures data integrity and provides a solid foundation for all data-related operations within the MLOps environment.

Benefits:

Efficiency: Automating data pipeline and model deployment processes reduces manual effort and increases productivity.
Reliability: Health checks and automatic restarts ensure that each service is always running optimally, providing a reliable infrastructure.
Flexibility: The modular nature of Docker Compose allows for easy updates and customization of the MLOps environment to fit specific requirements.
Reproducibility: With detailed tracking and version control, experiments and models can be easily replicated, facilitating collaboration and knowledge sharing.

Future Work: Elevating Scalability to New Heights

As we look to the horizon, our MLOps system stands poised for a transformative leap in scalability. The foundational architecture is meticulously designed to embrace expansion, accommodating the burgeoning demands of big data processing. Here’s what the future holds:

Kubernetes Integration: We will harness the power of Kubernetes to orchestrate containerized applications across a cluster of machines, ensuring seamless scalability and management. This will enable our system to dynamically adjust resources, handle increased loads, and maintain high availability with minimal downtime.
Hadoop Ecosystem: By incorporating Hadoop and its suite of tools like Hive and Spark, we will unlock distributed computing capabilities. This will allow us to process vast datasets efficiently, store data reliably in HDFS, and perform complex analytics at scale.
Advanced Monitoring Solutions: To keep pace with the expanded infrastructure, we will implement sophisticated monitoring solutions. These will provide deeper insights into system performance, resource utilization, and operational health, ensuring that as we grow, we maintain the highest standards of reliability and performance using the tool such as Grafana.
Cloud Integration: We are committed to continuous innovation. Our roadmap includes exploring cutting-edge technologies and methodologies that can further enhance our system’s scalability to cloud.
Community Collaboration: Recognizing the power of collective wisdom, we will foster a community of users and contributors. Their feedback and contributions will be invaluable in refining our scalability solutions, ensuring that our system not only meets but exceeds the expectations of the data science and machine learning communities.

The journey ahead is filled with exciting possibilities. As we expand our MLOps system, we remain dedicated to providing a robust, scalable solution that empowers organizations to harness the full potential of their data.

Additional Notes

For those who seeks to understand whole picture with detailed process and definition and architecture of MLOps, it is recommended to reference great paper below.

[1] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. 2023. Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access (2023).

References & Links

Work cited: feast-dev/feast-workshop: A workshop with several modules to help learn Feast, an open-source feature store https://github.com/feast-dev/feast-workshop (github.com)

Github repo: takehiro177/mlops-onpremises-slim (github.com)

Linkedin profile: www.linkedin.com/in/takehiro-ohashi-b54101174

Takehiro Ohashi