Gwalior
FULL_TIME
Skilled work
Global MNC Tech is seeking a highly skilled and proactive Site Reliability Specialist – High Availability to join our global technology operations team. This role is critical in ensuring the reliability, scalability, and performance of our mission-critical systems that support millions of users worldwide. You will work at the intersection of software engineering and IT operations, applying engineering principles to build resilient, self-healing, and highly available platforms.
As a Site Reliability Specialist, you will be responsible for designing, implementing, and maintaining systems that meet strict uptime and performance targets. You will collaborate closely with development, infrastructure, security, and business teams to continuously improve system reliability while driving automation and operational excellence.
Design, implement, and manage highly available and fault-tolerant systems across cloud and hybrid environments.
Monitor system performance, availability, latency, and capacity using advanced observability tools.
Develop and maintain automated monitoring, alerting, and incident response frameworks.
Lead root cause analysis (RCA) for major incidents and implement long-term corrective actions.
Drive automation of operational tasks using scripting and infrastructure-as-code (IaC) tools.
Participate in on-call rotations and provide support for production systems to ensure 24/7 reliability.
Collaborate with engineering teams to improve system architecture and deploy best practices for high availability.
Conduct regular disaster recovery (DR) drills and ensure business continuity plans are up to date.
Optimize system performance, reduce downtime, and improve mean time to recovery (MTTR).
Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
Strong experience in Site Reliability Engineering (SRE), DevOps, or Production Engineering roles.
Hands-on expertise with cloud platforms such as AWS, Azure, or Google Cloud.
Solid understanding of high availability architectures, load balancing, clustering, and failover mechanisms.
Proficiency in scripting and programming languages such as Python, Bash, Go, or Java.
Experience with containerization and orchestration tools (Docker, Kubernetes).
Knowledge of monitoring and observability tools like Prometheus, Grafana, ELK, Datadog, or New Relic.
Familiarity with CI/CD pipelines and automation tools (Jenkins, GitLab CI, Terraform, Ansible).
Strong understanding of networking concepts, security principles, and system performance tuning.
Bachelors degree in Computer Science, Information Technology, Engineering, or a related field.
4–8 years of experience in SRE, DevOps, Systems Engineering, or similar roles.
Proven track record of managing high-availability production systems in large-scale environments.
Experience working in fast-paced, high-growth technology organizations is highly desirable.
Full-time position with flexible working hours.
Participation in 24/7 on-call rotation as part of a global support team.
Hybrid or remote work options depending on business requirements and location.
Deep understanding of distributed systems and reliability engineering principles.
Strong analytical and problem-solving skills with attention to detail.
Ability to work under pressure and manage critical incidents effectively.
Excellent communication skills and ability to collaborate with cross-functional teams.
Strong documentation and knowledge-sharing mindset.
Passion for automation, continuous improvement, and operational excellence.
Competitive salary and performance-based incentives.
Comprehensive health and life insurance coverage.
Flexible work arrangements and work-life balance initiatives.
Professional development programs and technical training opportunities.
Access to cutting-edge technologies and global projects.
Paid time off, holidays, and wellness programs.
At Global MNC Tech, you will be part of a forward-thinking organization that values innovation, reliability, and engineering excellence. We offer a collaborative and inclusive culture where your ideas matter and your contributions directly impact global platforms used by millions. This role provides a unique opportunity to work on complex, large-scale systems and grow your career in one of the most in-demand technology domains.
Interested candidates are encouraged to submit their updated resume along with a brief cover letter highlighting their experience in high availability and site reliability engineering. Shortlisted candidates will be contacted for technical and HR interviews.