Consultant HPC Infrastructure Engineer

Remote Full-time
We are looking for a curious and driven engineer eager to step into the world of high-performance computing and AI infrastructure. In this role, you’ll gain hands-on experience supporting NVIDIA GPU clusters and automation pipelines that power some of the world’s most advanced AI workloads. Working alongside seasoned engineers, you’ll learn to apply Linux, Kubernetes, Terraform, and Prometheus in real-world environments where precision and scale truly matter. If you’re passionate about technology that defines the future of computers, this is your chance to grow within a team shaping that frontier. Office Travel: Frequent on-site work is required for this position (2–3 days/week) at our Santa Clara, CA office . Job responsibilities You will act as the initial responder to monitoring alerts, ensuring timely acknowledgment and preliminary triage of operational issues. You will automate operational procedures and diagnostics using established Infrastructure as Code (IaC) tools, including Bash , Python , Ansible , Terraform , and Helm , under the guidance of senior engineers. You will execute foundational diagnostics such as NCCL tests , DCGM (Data Center GPU Manager) , Fabric Diagnostics , and designated test workloads for training and inference, following standard procedures. You will apply a proactive and action-oriented mindset, resolving documented issues efficiently and suggesting improvements to runbooks or automation scripts based on recurring patterns. You will analyze and interpret diagnostic outputs to assess system health and identify early signs of degradation or instability. You will document all operational activities, system status changes, and troubleshooting steps with accuracy, clarity, and timeliness. You will use observability tools such as Prometheus and Grafana to analyze logs and metrics, supporting senior engineers in the root cause isolation process. You will develop hands-on familiarity with HPC workload management tools , including Slurm and/or Kubernetes . You will actively participate in training sessions and knowledge-sharing initiatives to deepen your understanding of the GB200/GB300 architecture and operational best practices. You will maintain a high level of discipline, attention to detail, and consistency across all operational tasks. Job qualifications Technical Skills You have foundational knowledge of Linux operating systems and are comfortable with the Unix command line , including using awk , Bash , and Python for log parsing and basic automation. You are familiar with or have exposure to HPC systems , including HPC schedulers (e.g., Slurm) or container orchestration tools (e.g., Kubernetes) . You are comfortable using observability platforms such as Prometheus and Grafana for log and metric visualization. You are familiar with Infrastructure as Code (IaC) concepts and can execute automation using tools like Ansible or Terraform . You have familiarity with GPU-based workloads and are eager to deepen your understanding of AI and HPC operations . Professional Skills You demonstrate strong analytical ability and can follow complex procedures while interpreting technical results (e.g., NCCL tests ). You communicate with clarity and accuracy, producing clear documentation and reports for both peers and senior engineers. You collaborate effectively with cross-functional teams, embracing mentorship and continuous feedback. You bring curiosity, persistence, and discipline , with a strong desire to learn and grow in advanced HPC operations. You work with attention to detail, ensuring consistency and accuracy in every task you undertake. You thrive in an environment that values learning, precision, and shared ownership . Growth Expectation We value curiosity and a growth mindset. Candidates are expected to bring a strong foundation in Linux and scripting from academic or prior professional experience. Proficiency in advanced scripting, IaC practices , and observability tooling (e.g., Prometheus , Grafana ) may be developed within the first six months through structured on-the-job training and mentorship from senior engineers. Other things to know Learning & Development There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys. About Thoughtworks Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary. #LI-Remote Salary Benefits: The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training. Salary$108,100—$162,000 USD See here our AI policy.
Apply Now

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote Full-time

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote Full-time

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote Full-time

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote Full-time

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote Full-time

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote Full-time

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote Full-time

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote Full-time

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote Full-time

USPS Office Helper

Remote Full-time

Revenue Anaylst

Remote Full-time

Join Today: Experienced CPC Processor I Customer Support Representative for Dynamic Healthcare Data Platform Company

Remote Full-time

Senior Data Engineer- REMOTE

Remote Full-time

Public Sector Account Executive – Germany

Remote Full-time

Flight Attendant Trainee

Remote Full-time

[Remote] Entry Level Sales***Fully Remote***

Remote Full-time

**Experienced Live Chat Agent – Remote Customer Service Representative for blithequark**

Remote Full-time

Senior Email Marketing Designer

Remote Full-time

Advogado Júnior em Direito Internacional – Junior Lawyer in International Law (Brasil, Remoto)

Remote Full-time

Skilled Team Member

Remote Full-time
← Back to Home