Description
About the Role
We are seeking a highly skilled and motivated Lead AI Platform Engineer to play a pivotal role in the development of our ML/AI platform. This role will be instrumental in building, maintaining, and scaling the core infrastructure, platform services, and CI/CD pipelines that underpin our machine learning initiatives and product launches. You will work on critical projects that directly impact our marketing, sales, service, and product growth verticals of the organization.
This isn't a traditional infrastructure role. You should be open to wearing multiple hats ; infrastructure, software engineering, UI/UX development, and AI-native tooling. We're looking for engineers who don't just build platforms for AI , they use AI to build the platform. You ship faster because you've made Claude Code, autonomous agents, and AI-powered developer tools part of your daily workflow, not an experiment you're still evaluating.
We want innovative, out-of-the-box thinkers who aren't afraid to experiment, build complex systems, and tackle challenges across the full stack with AI as the force multiplier at every layer.
What You'll Do
AI-Native Engineering & Developer Velocity
Use Claude Code (CLI) as a primary engineering tool writing, refactoring, debugging, and reviewing infrastructure and platform code with AI pair programming as the default, not the exception.
Build and publish reusable AI tools, skills, and integrations in internal tool marketplaces so that platform capabilities are discoverable and reusable across engineering teams.
Design and deploy autonomous agents that accelerate developer workflows, self-healing CI pipelines, automated onboarding bots, infrastructure diagnosis agents, and documentation generation.
Author and maintain CLAUDE.md files across platform repos, encoding platform conventions, deployment patterns, and team knowledge so that AI tools produce high-quality, context-aware output from day one.
Define and enforce AI-first engineering standards across the team: how engineers prompt, how context is managed, how agent output is reviewed before it ships.
Infrastructure Development
Design, implement, and manage secure and scalable cloud infrastructure (primarily AWS) including IAM permissions management, data management, and Kubernetes.
Leverage AI tooling (Claude Code, autonomous agents) to accelerate infrastructure-as-code authoring, drift detection, and security review reducing manual toil on repeatable tasks.
ML Platform Services
Develop and maintain core ML platform components: Model Registry, permissions services for project access, SageMaker default setup and deployment tooling.
Use AI-assisted development to accelerate the build-out of platform APIs, internal UIs, and self-service tooling for ML engineers and data scientists.
CI/CD and Workflow Automation
Build and optimize CI/CD pipelines using GitHub Actions for efficient and secure code deployment, Docker and package building, and security scanning.
Embed autonomous agent steps into pipelines auto-diagnosis on failure, AI-generated PR summaries, automated dependency updates so pipelines are self-documenting and partially self-healing.
Tooling, Developer Experience & Marketplace
Build and curate an internal AI tool and skills marketplace where engineers can discover, reuse, and extend trusted integrations connecting AI agents to platform data sources, APIs, and services via MCP servers.
Develop internal developer tools (web interfaces, AI assistants, CLI tools) that let ML engineers and data scientists self-serve without platform team involvement.
Implement secrets management, package/dependency management, testing frameworks, and observability integrations and use AI tooling to keep these maintained and documented at scale.
Architecture
Maintain a comprehensive view of how all platform components work together infrastructure, agent harnesses, tool marketplace, evaluation pipelines, observability.
Create architecture diagrams and own the long-term platform vision ; be the person who can articulate both where we are and where we're going.
Monitoring and Reliability
Establish monitoring solutions (Grafana, PagerDuty) and integrate security scanning to ensure platform health.
Use autonomous agents for first-line incident response: alert triage, log summarization, runbook execution, and escalation routing.
Security & Compliance
Participate in security reviews and ensure all platform components including AI tooling and agent infrastructure adhere to security best practices and compliance requirements.
Own the security posture of AI tool integrations: sandboxed execution, auditable agent traces, least-privilege tool permissions.
Collaboration & Documentation
Work closely with ML engineers, data scientists, and product managers to deliver robust, high-performance solutions.
Use AI-assisted documentation generation to keep platform docs, runbooks, and user guides current documentation that drifts is a platform liability.
What We're Looking For
Required
9+ years of proven experience as a Platform Engineer, Software Engineer, or ML Infrastructure Engineer.
Demonstrated AI-native engineering practice, you actively use tools like Claude Code (CLI), Cursor, or equivalent AI pair programmers as part of your daily engineering workflow; this is visible in your work, not aspirational.
Experience building or contributing to an internal tool or skills marketplace publishing reusable integrations, MCP servers, or AI building blocks that other teams depend on.
Experience designing and deploying autonomous agents that perform real engineering tasks: CI diagnosis, infrastructure ops, developer onboarding, documentation generation.
Strong software engineering skills in Python for building scalable tools, automation scripts, and platform components.
Experience with MCP (Model Context Protocol) servers building, hosting, and securing tool integrations for AI agents.
Strong expertise in AWS (IAM, EKS, S3, SageMaker, Lambda, etc.).
Extensive experience with CI/CD tools, especially GitHub Actions and ArgoCD.
Proficiency in infrastructure-as-code (Terraform).
Experience with containerization (Docker) and orchestration (Kubernetes).
Experience with MLOps concepts and tools.
Experience with model and agent evaluation.
Familiarity with monitoring and alerting systems (Grafana, PagerDuty).
Familiarity with Okta or similar IAM systems.
Experience with tenant and project onboarding in multi-tenant environments.
Familiarity with security best practices and conducting security reviews.
Experience developing internal developer tools (web, AI assistants, CLIs).
Ability to manage multiple priorities; excellent problem-solving and communication skills.
Preferred
Experience with the Salesforce ecosystem.
Familiarity with agent memory patterns: context management, long-term retrieval, episodic memory.
Experience with unstructured databases (vector or graph) and RAG pipelines.
Experience with modern data platforms: Snowflake, Kafka, Flink.
Experience with Feature Stores (e.g., Feast).
Experience with A/B testing and experimentation platforms.
Knowledge of Airflow or other workflow orchestration tools.