Job Description
About Etched
Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep & parallel chain-of-thought reasoning agents.
Job Summary
We are seeking a highly skilled and motivated ML Software Architect to join the Etched Inference Serving Stack team. You will be responsible for the deployment and optimization of large-scale AI models, particularly Mixture of Experts (MoE) architectures. This role focuses on developing strategies, software, and infrastructure for efficiently mapping and executing MoE models, including next-generation large language models, across distributed compute environments. You will play a pivotal role in ensuring the performance, scalability, and reliability of our largest AI workloads, collaborating closely with AI researchers and infrastructure teams.
Key responsibilities
MoE Model Mapping and Partitioning: Design, develop, and implement algorithms and software for partitioning and mapping MoE models (including experts and gating networks) across multiple Sohu accelerator hosts.
Distributed Systems Performance Optimization: Analyze and optimize the performance (latency, throughput, resource utilization) of distributed MoE model inference, focusing on computation, memory bandwidth, and network communication efficiency.
Distributed ML Deployment Frameworks: Design, build, and maintain software tools and frameworks automating deployment, scaling, and management of distributed MoE models.
Orchestration and Integration: Integrate MoE deployment strategies with cluster management and orchestration systems (e.g., Kubernetes, Slurm) and ML platforms (e.g., Kubeflow, Ray).
System Validation and Correctness: Develop and execute comprehensive test plans validating functionality, performance, scalability, and numerical correctness of distributed MoE model deployments.
Collaboration and Troubleshooting: Collaborate closely with ML engineers and infrastructure/hardware teams to understand inference stack and hardware capabilities/constraints, diagnosing and resolving complex distributed systems influencing large model deployment.
Representative projects
Develop and implement a novel mapping strategy for a large MoE language model onto a heterogeneous cluster of GPUs, TPUs or similar accelerators.
Build performance analysis tools to identify bottlenecks in distributed MoE inference across hundreds of nodes.
Optimize network communication patterns (e.g., expert routing via NCCL/MPI) for a specific MoE architecture.
Integrate automated, optimized MoE model deployment into an MLOps CI/CD pipeline.
Debug and resolve performance degradation or correctness issues in a large-scale MoE deployment.
Evaluate and compare different model parallelism techniques (e.g., expert, tensor, pipeline parallelism) for upcoming MoE models.
You may be a good fit if you have
Proficiency in Python and C++, Pytorch, JAX.
Strong understanding of large language model architectures, particularly Mixture of Experts (MoE).
Solid understanding of distributed systems concepts, algorithms, and challenges (e.g., consensus, consistency, communication patterns).
Experience in developing and optimizing software for distributed computing environments.
Strong understanding of operating systems (Linux preferred) and underlying hardware, including accelerator architectures (GPUs, TPUs) and high-speed interconnect technologies (e.g., NVLink, InfiniBand).
Experience analyzing performance traces and logs from distributed systems and ML workloads.
Strong candidates may also have experience with
Experience with cluster orchestration technologies (Kubernetes, Slurm) and ML platforms (Ray, Kubeflow).
Experience with specific model parallelism libraries or frameworks (e.g., DeepSpeed, Megatron-LM, FairScale, GSPMD, vLLM).
Familiarity with high-performance communication libraries (e.g., MPI, NCCL).
Experience with ML-specific profiling and debugging tools (e.g., NVIDIA Nsight Systems, PyTorch Profiler, TensorBoard).
Ideal Background:
Candidates with experience deploying large-scale LLMs in distributed environments.
Candidates with a deep understanding and hands-on experience designing, implementing, or optimizing MoE architectures.
Candidates who have developed software specifically for optimizing distributed computation (e.g., communication optimization, load balancing, custom kernels).
Candidates experienced in performance analysis, bottleneck identification, and optimization for distributed AI/HPC workloads.
Candidates familiar with mapping complex computational graphs onto parallel and distributed hardware.
Candidates with a background in High-Performance Computing (HPC) applied to Machine Learning problems.
Benefits
Full medical, dental, and vision packages, with 100% of premium covered
Housing subsidy of $2,000/month for those living within walking distance of the office
Daily lunch and dinner in our office
Relocation support for those moving to West San Jose
How we’re different
Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.
We are a fully in-person team in West San Jose, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.
Etched.ai is the developer of Sohu, an AI chip optimized for transformer models. By embedding the transformer architecture into its chips, Etched.ai is pioneering the creation of the servers for transformer inference.