As a Site Reliability Engineer, you will be a core contributor in Juniper Cloud platform team. Your core responsibility is to provide operational support of cloud-based SaaS applications with an emphasis on deployment, scalability, and reliability running on cloud infrastructure.
We are looking for a highly motivated, self-driven, and dedicated Site Reliability Engineer possessing hands-on experience with:
Experience building and running large-scale, fault-tolerant production cloud systems on AWS and/or GCP.
Coding infrastructure automation with Terraform, Packer, and Ansible.
Experience with Linux/Unix operating systems internals, file systems, system tuning, administration, and networking.
Deep experience in microservice technologies, container orchestration and continuous deployment (Kubernetes, Docker, Helm, GitOps with Flux CD/Argo CD).
Experience in designing, building, maintaining production services, troubleshooting large-scale distributed systems.
Experience with technologies like Apache Kafka, Redis/Valkey, Postgres, Elasticsearch.
Experience with observability tools and methodology (monitoring, logging, tracing, SLOs/SLIs) for detecting and diagnosing issues in advance before causing customer impact or performance degradation.
Strong software development using Python.
Have an urge for delivering quickly and effectively.
Strong problem solving and debugging skills with a high sense of ownership.
Responsibilities
Engage in and improve the whole lifecycle of services from inception and design, through to deployment, operation, and refinement.
Support development of services from planning phase before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Provide technical leadership and guidance to other team members on managing availability and performance of mission critical services, on building automation to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Capacity planning the growth of cloud infrastructure.
Improve operational processes such as deployments and upgrades.
Manage execution of project priorities, deadlines, and deliverables.
Be on an on-call rotation to respond to incidents that impact platform availability.
Use your on-call shift to prevent incidents from ever happening.
Experience in incident response, including conducting post-mortems and implementing lessons learned, enhances system reliability.
Preferred Qualifications
8+ years of engineering or systems experience.
Experience leveraging cloud architecture, applying site reliability principles, and/or demonstrating sensitivity to operational concerns.
Strong understanding of network design and architecture.
Scaling and managing distributed systems.
Significant experience with monitoring and observability platforms.
Demonstrated ability to debug, fix, and optimize code.
Troubleshooting skills across network, application, and distributed services layers.
The ability to learn quickly and adapt to new technologies is essential.
Excellent communications skills, both verbal and written.