We are seeking an experienced site reliability engineer for a direct-hire role in the Austin area.   This is at a great organization with a collaborative and welcoming work culture that has competitive compensation along with great benefits and perks.        

Job Responsibilities

  • Serve as an Incident Commander and lead incident response
  • Lead post-mortem retrospective meetings as well as create relevant post-incident documentation and communications
  • Implement and administer monitoring and alerting tooling to enable proactive incident response processes
  • Build and configure integrations between systems for monitoring, alerting and reporting system health
  • Track, report and effectively communicate system availability and performance metrics
  • Facilitate the creation of operational runbooks and document common recovery actions
  • You will work with teams to define Service Level Objectives  
  • Collaborate with software developers and architects to identify improvements that will increase the reliability of our systems

Required Skills & Qualifications

  • Experienced SRE with strong background acting as Incident commander as well as leading post-mortem
  • Scripting experience is highly preferred.  The team relies heavily on Powershell but Python is also emerging in use more
  • Experience in Dynatrace or similar platform is a huge plus
  • Plus if have used Ansible and Microsoft Flows in PowerAps
  • gExperience implementing monitoring, logging or alerting