We are seeking an experienced site reliability engineer for a direct-hire role in the Austin area. This is at a great organization with a collaborative and welcoming work culture that has competitive compensation along with great benefits and perks.
Responsibilities
* Serve as an Incident Commander and lead incident response * Lead post-mortem retrospective meetings as well as create relevant post-incident documentation and communications * Implement and administer monitoring and alerting tooling to enable proactive incident response processes * Build and configure integrations between systems for monitoring, alerting and reporting system health * Track, report and effectively communicate system availability and performance metrics * Facilitate the creation of operational runbooks and document common recovery actions * You will work with teams to define Service Level Objectives * Collaborate with software developers and architects to identify improvements that will increase the reliability of our systems
Required Skills
* Experienced SRE with strong background acting as Incident commander as well as leading post-mortem * Scripting experience is highly preferred. The team relies heavily on Powershell but Python is also emerging in use more * Experience in Dynatrace or similar platform is a huge plus * Plus if have used Ansible and Microsoft Flows in PowerAps * gExperience implementing monitoring, logging or alerting
Recent Comments