Staff Software Engineer - GenAI Performance and Kernel
DatabricksP-1285
About This Role
As a staff software engineer for GenAI Performance and Kernel, you will own the design, implementation, optimization, and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned, low-level compute paths, manage trade-offs between hardware efficiency and generality, and mentor others in kernel-level performance engineering. You will work closely with ML researchers, systems engineers, and product teams to push the state-of-the-art in inference performance at scale.
What You Will Do
- Lead the design, implementation, benchmarking, and maintenance of core compute kernels (e.g. attention, MLP, softmax, layernorm, memory management) optimized for various hardware backends (GPU, accelerators)
- Drive the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.
- Integrate kernel optimizations with higher-level ML systems
- Build and maintain profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps
- Lead performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation
- Establish coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability
- Influence system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)
- Mentor and guide other engineers working on lower-level performance, provide code reviews, help set best practices
- Collaborate with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitor their impact
What We Look For
- BS/MS/PhD in Computer Science, or a related field
- Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads
- Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.
- Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning
- Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CU
About the company
Databricks
Unified analytics and data lakehouse platform.
Similar roles
Counsel, Commercial & Partnerships
Airbnb
Account Executive, Strategic Accounts | Southern Europe
Airtable
Director, Product Management (Shopping & Offers)
Affirm
Product Manager
Airbnb