Software Engineer, TT-Distributed
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. They are seeking a TT-Distributed Software Engineer to develop and optimize distributed software systems for AI and HPC clusters, focusing on distributed programming and scalable architectures.ResponsibilitiesArchitect, implement, and optimize distributed software systems that coordinate computation and communication across clusters of AI accelerators and CPUsDesign and build distributed APIs enabling data-parallel and tensor-parallel AI workloadsLeverage MPI-based technologies and related frameworks to scale programming models across multiple hosts and compute nodesDevelop robust systems using IPC, inter-node sockets, and distributed communication primitives to ensure reliability and high performanceBuild and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads and collaborate with model and systems teams on cluster bring-upSkillsStrong C or C++ engineer with solid foundations in systems programming, operating systems, and distributed systems principlesEnthusiastic about distributed computing, including IPC, socket programming, and cluster resource coordinationComfortable reasoning about scalability, fault tolerance, and performance across multi-node environmentsCurious and first-principles thinker who challenges conventional approaches to distributed system designMotivated to grow into a deep technical expert in large-scale distributed AI infrastructureArchitect, implement, and optimize distributed software systems that coordinate computation and communication across clusters of AI accelerators and CPUsDesign and build distributed APIs enabling data-parallel and tensor-parallel AI workloadsLeverage MPI-based technologies and related frameworks to scale programming models across multiple hosts and compute nodesDevelop robust systems using IPC, inter-node sockets, and distributed communication primitives to ensure reliability and high performanceBuild and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads and collaborate with model and systems teams on cluster bring-upBenefitsHighly competitive compensation package and benefitsCompany OverviewTenstorrent develops AI hardware and software solutions for data processing and machine learning application. It was founded in 2016, and is headquartered in Toronto, Ontario, CAN, with a workforce of 501-1000 employees. Its website is http://tenstorrent.com.
Apply To This Job
Apply To This Job