Member of Technical Staff (AI Infrastructure Engineer)
PerplexityWe are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters
Responsibilities
Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
Qualifications
Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
Experience with deploying and managing distributed training systems at scale
Deep understanding of container orchestration and distributed systems architecture
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
Experience managing GPU clusters and optimizing compute resource utilization
Required Skills
Expert-level Kubernetes administration and YAML configuration management
Proficiency with Slurm job scheduling, resource management, and cluster configuration
Python and C++ programming with focus on systems and infrastructure automation
Hands-on experience with
Opens the company's application page
Listed via
Ashby
Similar roles
Sr. Customer Support Engineer, Raipur
Danaher
Collibra Platform Developer (Mid to Senior)
Arch Capital Group Ltd.
Scheduling Director (Renewables Construction)
MasTec Industrial
Mom and Baby Care Manager - RN - Must reside in Nevada
CareSource
Design & Tech
Related reads from TCHNX

The Emergence of Small Language Models: Why Efficiency Is Overtaking Scale
As the AI industry confronts computational costs and environmental concerns, a new generation of compact models is proving that bigger isn't always better. Small language models are reshaping enterprise AI deployment.

The Quiet Revolution in Local-First Software
As major platforms face outages and data breaches, a new generation of developers is building applications that prioritise local data storage and peer-to-peer sync, challenging the cloud-first orthodoxy that's dominated tech for two decades.

The Return of Physical Controls: Why Haptic Feedback Is Reshaping Digital Interfaces
After years of pursuing flat, buttonless designs, tech companies are rediscovering the value of tactile interaction. A new wave of products proves that touching isn't just feeling it's understanding.