Are you passionate about ensuring the reliability and performance of mission-critical cloud services? Salesforce is seeking a talented Site Reliability Engineer to join our dynamic team in our Denver, CO, location, supporting our GovCloud environment. As a key member of our Site Reliability organization, you'll play a vital role in maintaining 99.99% uptime for customer-facing services, proactively addressing issues, and ensuring the security of our data. We foster a collaborative and innovative culture, where you’ll work alongside skilled engineers to solve complex problems and drive continuous improvement.

Please Note: This position requires a successful background investigation and the ability to obtain and maintain a specific level of U.S. government background clearance. Details will be provided during the interview process.

Shift Requirements: This role involves shift work, including night shifts, as part of a 24/7 support team. We provide a rotating schedule and ensure adequate compensation for shift differentials.

About the Role:

The Site Reliability team at Salesforce is the backbone of our cloud operations, working around the clock to keep our services available and our customers protected. You will be a crucial part of the GovCloud Incident Response (GIR) team, which maintains the current infrastructure through day-to-day alert response, smart hands support, and comprehensive incident management, including retrospectives and long-term remediation.

Your Responsibilities:

Ensure 99.99% uptime for customer-facing services by proactively monitoring and maintaining the health of supporting systems, contributing directly to customer satisfaction and trust.
Act in key support roles during major incidents (e.g., Sev0, Sev1) and participate in technical incident reviews for problem management.
Contribute to Problem Management by populating and participating in Root Cause Analyses (RCAs) and handing them off to the Global Solutions team.
Ensure all work carried out by the Site Reliability team aligns with the company’s internal compliance policies and directives.
Collaborate with technical staff to solve complex technical issues and customer concerns.
Lead and mentor other team members in staying abreast of industry innovations and technologies, and assist in team development growth.
Thrive in a fast-paced environment, solving sophisticated issues quickly and successfully balancing multiple priorities.
Automate the detection and resolution of recurring issues in the production environment.
Help create and improve current processes to reduce operational and engineering toil, including the implementation of AI-driven automation for routine tasks.

Basic Requirements:

Citizenship: U.S. citizen (U.S. born or naturalized) who does not hold dual citizenship. You agree to complete a Minimum Background Investigation (MBI) for a Moderate Public Trust position with the U.S. federal government or other clearances as deemed appropriate for the role.
Education: Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
Experience: Systems engineering experience in enterprise-scale internet service engineering or support role.
Technical Skills:
- Expertise in TCP/IP related technologies (networking protocols, network programming, etc.).
- Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD), with significant exposure to Red Hat Enterprise Linux and Solaris.
- Strong understanding of monitoring security systems and administration.
- Experience provisioning, operating, and running AWS/C2S based infrastructure and systems.
- Proficiency in scripting with Python, Go, or other languages.
Communication: Strong written and oral communication skills.
Incident Management: Past experience in Incident Management and a good understanding of ITIL service operations.
Availability: Ability to participate in a 24/7 on-call rotation supporting large data center operations and be available for shift work.

Preferred Qualifications:

Prior experience with Chef/Puppet or automated deployment. (This helps streamline our infrastructure management.)
Prior experience with Jenkins/Bamboo/Spinnaker pipeline execution. (This aids in our continuous integration and deployment processes.)
Experience supporting and maintaining monitoring and alert systems. (Ensures proactive issue detection.)
Experience supporting and maintaining Java applications. (Supports our application stack.)
Hands-on experience configuring and running AWS (Amazon Web Services) using the CLI/SDKs. (Essential for our cloud infrastructure.)
Certifications in Linux+, RedHat, and AWS. (Validates technical expertise.)
Experience supporting and leading Kubernetes-based applications and services. (Supports our containerized environment.)
Familiarity with Agile Process and DevOps practices. (Enables efficient workflow and collaboration.)
Experience participating in blameless retrospectives, learning from incidents, and conducting post-incident investigations, with an interest in how AI can assist in root cause analysis and pattern identification. (Promotes a culture of continuous improvement.)
Working knowledge of and interest in resilience engineering, including concepts such as Safety II and proactive problem prevention, leveraging AI for proactive risk identification and system optimization. (Enhances system reliability.)
Experience with AI/ML concepts and tools for operational insights, predictive maintenance, or intelligent automation.
Familiarity with data analysis and visualization tools to interpret AI-generated insights.

This candidate must be a U.S. citizen (U.S. born or naturalized) who does not hold dual citizenship and agrees to complete a U.S. federal government Minimum Background Investigation (MBI) for a Moderate Public Trust position.

Apply now to join our dynamic team and help us drive incident response efficiency and system resilience.

Apply now

See more open positions at Own Company

Privacy policy Cookie policy