What You'll Do
- Plan, Design & Execute Chaos & Performance Experiments for our enterprise system
- Maintain and improve reliability of core software systems.
- Prioritize customer satisfaction in all efforts.
- Continuously learn and adapt to new technologies and methodologies
- Collaborate effectively with stakeholders and other Engineers.
- Quickly respond to changes and resolve issues.
- Take accountability for issue resolution and prevention.
- Utilize automation tools to streamline processes and minimize manual intervention.
What You Bring (At least 5-7 years' experience)
- Bachelor's degree in Computer Science, Information Technology, or equivalent education and experience.
- Expertise in application performance monitoring, observability, and proactive alert correlation, including monitoring containers and failure-based alerting.
- Skilled in defining service level objectives, measuring service level indicators, and setting up error budgets.
- Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems.
- Successful in improving CI/CD pipelines and build/release processes.
- Experienced in creating SRE adoption framework and onboarding procedure.
Technology Stack
- Cloud Computing Platform: Openshift, Any flavour of Kubernetes, Docker
- Monitoring and Logging Tools(s): Splunk, Thanos, Dynatrace Prometheus
- Networking Technology: Protocols, Load Balancers, Firewalls
- Programming: Java, Python, YAML
- Code Repos: GitHub
- Infrastructure as code: Terraform,
- Automation Tools: Jenkins, Chef, Puppet