Software Engineer - Reliability

Remote Full-time
About Luma AI Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work. Where You Come In We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure. You will build, maintain, and scale Luma’s infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams. What You’ll Do • Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates. • Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance. • Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment. • Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level. • Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil. • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA. Who You Are • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment. • Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance. • Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI. • Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect. • Startup DNA: You are energetic and thrive in a less structured, fast-paced environment. • Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO. • Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs. What Sets You Apart (Bonus Points) • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm. • Experience managing large-scale GPU clusters for AI/ML workloads (training or inference). • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray. Apply tot his job
Apply Now

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote Full-time

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote Full-time

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote Full-time

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote Full-time

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote Full-time

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote Full-time

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote Full-time

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote Full-time

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote Full-time

USPS Office Helper

Remote Full-time

[Remote] IT Assistant

Remote Full-time

**Experienced Customer Service Representative – Remote Opportunity with arenaflex**

Remote Full-time

**Experienced Online Customer Chat Specialist – Delivering Exceptional Support at arenaflex**

Remote Full-time

Clinical Research Program Manager

Remote Full-time

Experienced Customer Service Representative - Overnight Shift - Work from Home in Illinois - Specialty Pharmacy Support at Blithequark

Remote Full-time

**Experienced Guest Relation Officer / Live Chat Agent – US Remote Customer Support Specialist**

Remote Full-time

Career Opportunities: Vice President Marketing (61102)

Remote Full-time

Senior Fraud Risk Manager, Operations and Governance

Remote Full-time

Customer Service Associate - Remote - Delivering Exceptional Experiences in Home Automation and Concierge Services

Remote Full-time

Front Desk Agent - Home2 Suites by Hilton in Glens Falls, NY

Remote Full-time
← Back to Home