Machine Learning - Infrastructure

United KingdomOn-siteengineering Today

Our mission is general causal intelligence, AI that is capable of (1) predicting the future and (2) identifying the optimal actions to change that future.

To achieve this breakthrough, we are building a Large Physics foundation Model (LPM) because domains governed by physics have inherent cause and effect relationships, unlike visual or textual data.

Weather is the ideal training ground for an LPM. It is the most well-observed physical system, offering rapid, objective ground truth feedback from sensory observations and data at a scale that dwarfs what is used to train today’s LLMs.

Causal Labs is a team of researchers and engineers from self-driving, drug discovery, and robotics - including Google DeepMind, Cruise, Waymo, Insitro, and Nabla Bio - who believe general causal intelligence will be the most important technical breakthrough for civilization.

We look for infrastructure engineers who are excited to tackle unsolved problems.

Our training and inference challenges demand deep expertise in setting up distributed training clusters and optimizing performance for large models. If you have experience building large-scale ML infrastructure in related fields such as language and vision models, robotics, biology -- join us on this mission.

Responsibilities

Design, deploy, and maintain large distributed ML training and inference clusters
Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle
Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales
Analyze, profile and debug low-level GPU operations to optimize performance
Stay up-to-date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

Strong grasp of state-of-the-art techniques for optimizing training and inference workloads
Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
Background working on distributed task management systems and scalable model serving & deployment architectures
Understanding of monitoring, logging, observability, and version control best practices for ML systems

You don’t have to meet every single requirement above.

Opens the company's application page

About the company

Causal

All open roles

Listed via

Ashby

Similar roles

Head of Integration, Data & GenAI Engineering

AJ Bell

United KingdomOn-site

Principal Engineer

Appcastenterprise

Birmingham, UKOn-site

Fire Sprinkler Inspector

Appcast Enterprise

Cork, UKOn-site

Senior Enterprise Architect

IBM

Sydney, Sydney Region, AUHybrid

Design & Tech

Related reads from TCHNX

View all →

Technology

The Quiet Revolution in Local-First Software

As major platforms face outages and data breaches, a new generation of developers is building applications that prioritise local data storage and peer-to-peer sync, challenging the cloud-first orthodoxy that's dominated tech for two decades.

tchnx.com

Technology

The Quiet Revolution in Edge AI: Why Your Next Computer Might Not Need the Cloud

As neural processing units become standard in consumer devices, we're witnessing a fundamental shift in how AI applications work. Local processing is no longer a fallback; it's becoming the preferred architecture.

tchnx.com

Technology

The Rise of AI-Assisted Code Generation 2: Are Developers Becoming Prompt Engineers?

As AI coding assistants reshape software development, the industry grapples with a fundamental question: is writing code giving way to writing prompts? We examine how London's tech scene is adapting to this seismic shift.

tchnx.com