Description
Opportunity & Product
Join an agile team with deep startup roots. We operate as a high-velocity ‘startup-within-Salesforce,’ following our recent acquisition. You’ll be managed by the same founders and engineers who built the original company, offering the autonomy of a small team backed by the global scale and trust of Salesforce.
We have successfully moved past the "0 to 1" phase. We have a product that works, customers who love it, and the backing of Salesforce. Now, we are entering the "1 to 100" phase: scaling our architecture to handle global demand, hardening our systems for enterprise-grade resilience, and integrating deeply with the Agentforce ecosystem. This is your chance to help lead that transition.
What You’ll Do
As a Production Support Engineer (LMTS), you will be a senior technical lead within our embedded reliability team. You aren’t building the foundation alone—you’ll work alongside a group of engineers and product owners to ensure the Agentforce for Supply Chain platform is the most reliable AI-powered engine in the industry.
This is a role for an engineer who loves the "scaling" problem. You will focus on production excellence, performance tuning, and infrastructure automation. Because you are embedded in the product organization, you’ll have a seat at the table during design reviews, ensuring that as we add new agentic capabilities, they are built to scale from day one.
Responsibilities
Scaling & Reliability: Own the reliability roadmap for major product areas, working to transition our systems from startup-speed architectures to highly-available, global-scale enterprise solutions.
Collaborative Leadership: Partner with PMTS-level engineers to refine our infrastructure strategy, contributing senior-level perspectives on system design, capacity planning, and bottleneck identification.
Infrastructure as Code: Maintain and evolve our automated environments, focusing on making our "infrastructure-as-plugins" model more robust and developer-friendly.
AI Operations (AIOps): Support the scaling of our AI/ML infrastructure, ensuring our models have the GPU resources and data pipelines required to deliver real-time supply chain insights.
Production Excellence: Lead the "1 to 100" hardening of our observability stack. You won’t just respond to incidents; you’ll build the tooling that prevents them and the telemetry that explains them.
Performance Engineering: Deep-dive into SQL optimization, API latency, and cross-service communication to ensure our data-intensive supply chain platform remains performant under heavy load.
AI-First Workflow: Lean into the future of engineering by using AI tools (Claude Code, etc.) to automate routine operational tasks and accelerate infrastructure delivery.
Contribute to building and maintaining the shared system context, an explicit repository of system designs, constraints, and standards that enables AI to operate accurately and reliably.
Critically evaluate code (Human or AI-generated) for correctness, quality, security, and performance
Required Qualifications
5+ years of experience in SRE, Production Engineering, or Backend Engineering with a heavy focus on operations and scale.
Proven Scaling Experience: You have previously helped take a product through a high-growth phase (the "1 to 100" journey), dealing with the technical debt and architectural shifts that come with it.
Technical Breadth: Strong proficiency in Kubernetes, Terraform/OpenTofu, and AWS/GCP/Azure.
Coding Mastery: Ability to write and review production-level code in Golang, TypeScript, or Python—you view automation as a software engineering problem.
Systems Expert: Deep understanding of distributed systems, including how to debug complex interactions between microservices, databases, and AI agents.
Low-Ego Collaboration: Experience working within a senior team of Principal engineers, capable of both leading specific initiatives and supporting the broader group’s technical vision.
A demonstrated, genuine AI-first approach to engineering. Using AI to move faster, build fluency across the stack, and contribute well beyond your core specialty.
Experience using AI tools (e.g., Claude Code, GitHub Copilot, Codex, Cursor, etc.) in development workflows
Advanced prompt engineering skills and the ability to write precise, structured prompts and cultivate the system context that makes AI outputs reliable, secure, and production-ready.
Preferred Qualifications
M.S. in Computer Science or equivalent practical experience.
Database Specialist: Strong experience with PostgreSQL at scale (partitioning, indexing, query tuning).
Distributed Systems: Advanced knowledge of microservice orchestration and durability patterns, including hands-on experience with Temporal for workflow reliability and service mesh for secure, observable service-to-service communication in high-growth SaaS environments.
Supply Chain/Logistics: Experience with the unique data constraints and reliability requirements of manufacturing or global logistics.
Salesforce Knowledge: Familiarity with Salesforce infrastructure, Hyperforce, or Data Cloud is a plus.
Public Cloud Expertise: Deep knowledge of networking, security, and identity management within major cloud providers.
For roles in San Francisco and Los Angeles: Pursuant to the San Francisco Fair Chance Ordinance and the Los Angeles Fair Chance Initiative for Hiring, Salesforce will consider for employment qualified applicants with arrest and conviction records.