We are seeking an experienced site reliability engineer for a direct-hire role in the Austin area. This is at a great organization with a collaborative and welcoming work culture that has competitive compensation along with great benefits and perks.
Job Responsibilities
- Serve as an Incident Commander and lead incident response
- Lead post-mortem retrospective meetings as well as create relevant post-incident documentation and communications
- Implement and administer monitoring and alerting tooling to enable proactive incident response processes
- Build and configure integrations between systems for monitoring, alerting and reporting system health
- Track, report and effectively communicate system availability and performance metrics
- Facilitate the creation of operational runbooks and document common recovery actions
- You will work with teams to define Service Level Objectives
- Collaborate with software developers and architects to identify improvements that will increase the reliability of our systems
Required Skills & Qualifications
- Experienced SRE with strong background acting as Incident commander as well as leading post-mortem
- Scripting experience is highly preferred. The team relies heavily on Powershell but Python is also emerging in use more
- Experience in Dynatrace or similar platform is a huge plus
- Plus if have used Ansible and Microsoft Flows in PowerAps
- gExperience implementing monitoring, logging or alerting
Recent Comments