Sr Director of Engineering - Infinia
OverviewThis is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. Job DescriptionSr Director of Engineering - Infinia Distributed PlatformWe are looking for an experienced and technically driven Director of Engineering to lead the Infinia Distributed Platform organization — the foundational team powering DDN’s flagship AI-native distributed data platform.In this role, you will oversee engineering teams responsible for the core systems that enable Infinia’s performance, scalability, and reliability at global scale. This includes mission-critical components such as task scheduling, distributed tracing, memory management, SPDK data access, profiling, networking, reliability, distributed locking, internal key-value stores, and filesystem clients — all orchestrated within a multi-tenant, high-throughput environment.You will define the strategy, scale execution, and mentor engineering leaders to deliver production-grade systems that meet the demands of AI/ML, high-performance computing, and enterprise analytics.This is a hands-on technical leadership role at the heart of Infinia’s distributed architecture — where decisions today shape how data moves tomorrow.Key ResponsibilitiesCore Systems LeadershipLead and scale multiple engineering teams focused on critical path components of the Infinia platform:Task scheduling and orchestrationTracing and observability infrastructureMemory management and performance tuningSPDK-based I/O data pathReliability and fault-tolerance systemsNetworking stack optimization and event-driven IOTDS (Tenant Data Services) and multi-tenant isolationDLM (Distributed Lock Manager) and concurrency controlInternal KVStore for system metadata and stateFS client for scalable POSIX-like accessTechnical Strategy & ExecutionOwn the end-to-end architecture, roadmap, and execution for all core components.Guide technical design reviews, enforce performance standards, and align cross-team priorities to platform milestones.Collaborate with architecture and infrastructure teams to evolve platform interfaces, service contracts, and internal APIs.Organizational Growth & Team DevelopmentHire, mentor, and develop engineering managers and senior ICs to build a culture of accountability, innovation, and technical rigor.Drive a results-oriented mindset focused on high-velocity, high-reliability software delivery.Set clear goals and foster professional growth through coaching, feedback, and performance management.Cross-Functional CollaborationPartner with product management, field engineering, and customer teams to shape feature priorities and ensure core platform needs are anticipated early.Interface with support and site reliability teams to define SLAs, improve telemetry, and reduce MTTR for platform incidents.Contribute to platform-wide initiatives in multi-tenancy, fault isolation, observability, and performance benchmarking.Platform Reliability & PerformanceChampion operational excellence across core services — including incident response, regression testing, and release stability.Optimize memory usage, lock contention, thread scheduling, and task pipelines to deliver microsecond-level performance where required.Establish strong internal metrics and observability standards to measure system health, responsiveness, and uptime.Required Qualifications12+ years of engineering experience in distributed systems, operating systems, or storage platform engineering.5+ years of experience leading multi-team organizations delivering core systems software in production environments.Strong expertise in systems programming (C, C++, Rust) and deep knowledge of concurrency, memory models, and network programming.Proven track record designing and scaling services related to task scheduling, locking, memory, and I/O performance.Experience managing components at the intersection of infrastructure and application performance, especially in multi-tenant platforms.Excellent communication, roadmap planning, and cross-functional leadership skills.Preferred QualificationsExperience with SPDK, RDMA, DPDK, or high-performance storage stacks.Knowledge of distributed coordination protocols, key-value stores, or scalable metadata architectures.Background in AI/ML, HPC, or cloud-native infrastructure (Kubernetes, microservices, etc.).Familiarity with observability tools (e.g., tracing frameworks, profilers, Prometheus, OpenTelemetry).Success Metrics – First 30 DaysStrategic AlignmentRamp up on all core components, existing technical challenges, and roadmap priorities.Meet with team leads and cross-functional partners to assess execution readiness and architectural cohesion.Early ImpactIdentify 2–3 areas for performance optimization, team structure refinement, or architectural alignment.Deliver a 90-day strategy plan outlining key initiatives across reliability, latency, and scalability.Team IntegrationBuild trust and alignment with engineering managers and ICs.Assess hiring needs and begin shaping the next phase of team growth.Success Metrics – Beyond 30 DaysTimely, high-quality delivery of core platform milestones aligned to product roadmap.Improvements in performance, fault-tolerance, and memory/network efficiency across key subsystems.Clear reduction in escalations, latency spikes, and cross-component coordination complexity.Team health, engagement, and velocity aligned with long-term technical and business goals.Originally posted on Himalayas
Apply Now
Apply Now