AI Infra Engineer – SRE (Kubernetes)

Remote Full-time
Job Category:
Software Engineering

Job Type:
Full Time

Job Location:
Hybrid Remote

About The Role
We are a fast-growing AI infrastructure company building cutting-edge GPU cloud platforms and high-performance inference solutions that empower AI developers, startups, and enterprises worldwide. As we scale our global operations, we are looking for a skilled and hands-on
AI Infra Engineer – SRE (Kubernetes)
to join our Global Infrastructure team.

Role Overview This is a critical hands-on position focused on the reliability, performance, and operational excellence of large-scale, high-performance AI/ML GPU clusters in our data centers. As an AI Infra Engineer – SRE (Kubernetes), you will design, operate, and optimize Kubernetes-based infrastructure to ensure maximum uptime, efficiency, and scalability for demanding AI workloads.

You will bring deep expertise in system-level troubleshooting, GPU cluster management, and automation to keep our platforms running at peak performance.

Key Responsibilities
• Design, build, and maintain scalable, production-grade AI/ML infrastructure using Kubernetes.
• Proactively monitor GPU cluster health, performance, and utilization across compute, accelerators, storage, and networking layers, performing root-cause analysis and resolution.
• Develop and implement automation for infrastructure provisioning, configuration, and ongoing management.
• Own the complete GPU node lifecycle — including provisioning, dynamic scaling, maintenance, decommissioning, and zero-downtime upgrades of GPU-enabled nodes in Kubernetes environments.
• Build and improve CI/CD pipelines for reliable infrastructure deployment and orchestration.
• Enforce security best practices, compliance standards, and operational excellence across the infrastructure stack.
• Lead incident response and post-incident improvements for issues related to GPUs, CPUs, high-speed storage, and networks.
• Manage end-to-end customer GPU resource provisioning — from request intake and configuration to onboarding, troubleshooting, and support — ensuring high levels of customer satisfaction.
• Stay up to date with the latest GPU hardware, software, and orchestration technologies, integrating relevant advancements into our platforms.
• Be available for occasional regional or international travel to data center locations as required.

Requirements
• Bachelor’s degree in Computer Science, Engineering, or a related technical field.
• 3+ years of practical experience in data center operations, infrastructure engineering, or site reliability engineering.
• Strong background in infrastructure automation using tools such as Terraform and Ansible.
• Deep hands-on experience with Kubernetes in large-scale environments, including:
• NVIDIA GPU Operator for GPU driver management, device plugins, container toolkit, and monitoring (DCGM).
• NVIDIA Network Operator for high-performance networking, RDMA, and GPUDirect support.
• CNI (Container Network Interface) and CSI (Container Storage Interface) plugins tailored for AI/ML workloads.
• Integration with job schedulers such as Slurm in Kubernetes clusters.
• Proficiency in Linux system administration and scripting (Python, Bash).
• Experience with observability stacks including Prometheus, Grafana, and Loki.
• Solid understanding of GPU architecture, NVIDIA CUDA, NCCL, and AI/ML frameworks is a strong plus.
• Excellent troubleshooting skills with the ability to analyze complex system logs and performance metrics.
• Strong communication and collaboration skills to work effectively with engineering and operations teams.
Apply Now

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote Full-time

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote Full-time

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote Full-time

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote Full-time

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote Full-time

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote Full-time

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote Full-time

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote Full-time

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote Full-time

USPS Office Helper

Remote Full-time

Partner, Climate

Remote Full-time

Nurse Career Change: Remote Marketing Opportunity | Work From Anywhere

Remote Full-time

[Remote-Position] Adjunct Faculty, Teaching and Learning

Remote Full-time

Experienced Full Stack Remote Data Entry Assistant – Social Media Management and Administration at Blithequark

Remote Full-time

Laborer jobs available

Remote Full-time

**Experienced Director of Data Science – Global Data Strategy and Analytics**

Remote Full-time

Utilization Review Specialist / RN - Full-time

Remote Full-time

(Pharmacist Centralized Services ) Walgreens Work From Home (Remote) -

Remote Full-time

Financial Analyst Opportunity at American Express - Shape the Future of Financial Services in New York, $30/Hour

Remote Full-time

**Experienced Remote Data Entry Specialist – Flexible Work Arrangement**

Remote Full-time
← Back to Home