Senior AI Infrastructure & Platform Engineer - Riyadh,KSA

DeepSource Technologies


Date: 4 hours ago
City: New Delhi, Delhi
Contract type: Full time

Role Overview

We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI/ML pipelines, cluster scheduling, and orchestration.

Key Responsibilities

  • Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
  • Manage and operate GPU orchestration tools and platforms such as:
    • Nvidia Base Command Manager (critical)
    • Nvidia AI Enterprise Suite
    • Nvidia GPU and Network Operators
    • Nvidia NIMs and Blueprints
  • Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:
    • Slurm (critical)
    • Vanilla Kubernetes
  • Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
  • Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads.
  • Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
  • Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management.
  • Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies.

Requirements

Required Skills & Experience

  • Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
  • Hands-on experience with:
    • Nvidia Base Command Manager
    • Nvidia AI Enterprise Suite
    • Nvidia GPU/Network Operators, NIMs, Blueprints
  • Strong experience with Slurm and/or Kubernetes orchestration.
  • Solid Linux system administration skills — preferably on Ubuntu or similar distributions.
  • Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
  • Excellent troubleshooting and performance-tuning skills.
  • Experience collaborating with ML/data science teams and integrating infrastructure with their workflows.
  • Strong understanding of networking, security, resource allocation, and cluster management best practices.

Preferred Qualifications

  • Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
  • Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
  • Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
  • Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.

Post a resume

Similar jobs

Senior Network Security Engineer - Riyadh, KSA

DeepSource Technologies, New Delhi, Delhi
3 days ago
• Deep technical experience on F5 advanced WAF• Identify and clean up unused or orphaned IP addresses on F5 BIG-IP load balancers to improve performance, efficiency, and manageability.• Review, validate, and remove unused, duplicate, or obsolete firewall policies across HQ and DR data centers while maintaining security posture and compliance.• Ensure firewall and load balancer changes are aligned with high...

Civil Engineer & Python Expert - Freelance AI Trainer

Mindrift, New Delhi, Delhi
4 days ago
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.What this opportunity involves While each project involves unique tasks, contributors may:  Design original computational engineering problems that simulate real engineering workflows; Create...

Strategy Consultant - AI Training & Evaluation (MBB & Top-Tier Firms)

Mindrift, New Delhi, Delhi
1 week ago
Toloka AI supports frontier model post-training by building domain-specific reinforcement learning environments, tasks, and evaluation frameworks designed by real practitioners. Mindrift, powered by Toloka — a leading enterprise AI and machine learning data partner since 2014 — connects top domain experts with cutting-edge AI initiatives. Backed by Toloka’s deep expertise in scalable data generation, crowd technology, and applied ML systems,...