SRE Observability Lead - Remote Kubernetes Clusters

Location

Singapore

Job Type

FULL_TIME

Experience

Skilled work

Job Description

Job Summary

Startup Inno is seeking an experienced SRE Observability Lead to drive the monitoring, reliability, and performance of our cloud-native infrastructure. This role will focus on building end-to-end observability across Kubernetes clusters, ensuring optimal system performance, scalability, and reliability. You will partner closely with engineering and DevOps teams to design, implement, and maintain robust monitoring, logging, and alerting systems, enabling proactive incident detection and response. This is an excellent opportunity for a proactive, technically skilled leader to shape the observability culture in a fast-paced, innovative startup environment.


Key Responsibilities

  • Lead the design, implementation, and management of observability solutions across multiple Kubernetes clusters.

  • Develop and maintain monitoring, logging, tracing, and alerting systems to ensure service reliability.

  • Collaborate with SRE, DevOps, and engineering teams to define SLIs, SLOs, and error budgets.

  • Proactively identify potential system performance bottlenecks and recommend scalable solutions.

  • Implement automated tools for system health checks, incident response, and postmortem analysis.

  • Mentor and guide junior SRE and engineering team members on observability best practices.

  • Work closely with product and engineering teams to provide operational insights that inform architecture and development decisions.

  • Continuously evaluate and recommend new technologies and tools to enhance observability capabilities.


Required Skills and Qualifications

  • Strong expertise in Kubernetes architecture and operations.

  • Proven experience with observability tools: Prometheus, Grafana, Jaeger, OpenTelemetry, ELK stack, or equivalent.

  • Solid understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration.

  • Proficiency in scripting and automation (Python, Go, Bash, or similar).

  • Experience in monitoring distributed systems and microservices architectures.

  • Strong incident management and troubleshooting skills in complex production environments.

  • Excellent collaboration, leadership, and communication skills.


Experience

  • Minimum of 5–7 years of experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.

  • At least 3 years in observability-focused roles with hands-on experience in Kubernetes environments.

  • Experience leading or mentoring teams in observability, monitoring, and reliability practices.


Working Hours

  • Full-time, remote position.

  • Flexible hours, with occasional on-call rotation for incident management.

  • Overlap with global engineering teams may be required for collaboration.


Knowledge, Skills, and Abilities

  • Deep understanding of distributed systems, containerized workloads, and cloud-native architectures.

  • Ability to analyze metrics, logs, and traces to identify patterns, anomalies, and performance issues.

  • Strong problem-solving skills with the ability to make quick, data-driven decisions.

  • Skilled in designing scalable, highly available systems with a focus on operational excellence.

  • Exceptional interpersonal skills to communicate complex technical information to non-technical stakeholders.


Benefits

  • Competitive salary with performance-based bonuses.

  • Fully remote work with flexible schedules.

  • Professional development and training opportunities.

  • Access to cutting-edge observability and cloud-native tools.

  • Health, wellness, and insurance packages (where applicable by region).

  • Collaborative and innovative startup culture with opportunities for impact.


Why Join Startup Inno?

  • Be a part of a fast-growing, innovative startup shaping the future of cloud-native solutions.

  • Work with a highly skilled and collaborative team passionate about technology and innovation.

  • Take ownership of critical reliability and observability initiatives that directly impact product performance.

  • Access to continuous learning opportunities and career growth in a cutting-edge tech environment.


How to Apply

Interested candidates should submit their resume and cover letter highlighting relevant observability and Kubernetes experience to us. Please include examples of prior work with monitoring systems, distributed architectures, or Kubernetes reliability projects.

Additional Details

Similar Jobs

Apply Now