Munich
FULL_TIME
Skilled work
Startup Inno is seeking a highly skilled SRE Observability Lead to join our dynamic engineering team. In this role, you will lead the design, implementation, and optimization of our observability frameworks for Kubernetes clusters and cloud-native environments. You will be responsible for ensuring high reliability, performance, and availability of our services while providing actionable insights through advanced monitoring, logging, and alerting systems. This is a remote role offering the flexibility to work from anywhere while contributing to a fast-growing, innovative startup.
Lead the design and implementation of observability solutions across Kubernetes clusters, microservices, and cloud-native applications.
Develop and maintain metrics, logging, tracing, and alerting systems to ensure optimal system performance and reliability.
Define and implement SLOs, SLIs, and error budgets to measure system health and reliability.
Collaborate with development and DevOps teams to embed observability best practices into CI/CD pipelines.
Identify system bottlenecks, troubleshoot incidents, and implement long-term solutions to improve system resilience.
Mentor and guide junior SREs and engineers on observability practices and reliability engineering.
Evaluate and integrate new monitoring, logging, and tracing tools to enhance observability capabilities.
Contribute to disaster recovery planning, incident response procedures, and postmortem analyses.
Strong experience with Kubernetes, including cluster administration and deployment strategies.
Hands-on expertise with observability tools such as Prometheus, Grafana, Jaeger, OpenTelemetry, ELK stack, or equivalent.
Proficiency in scripting and automation using Python, Go, Bash, or similar languages.
Deep understanding of distributed systems, cloud architecture (AWS, GCP, or Azure), and microservices patterns.
Strong knowledge of CI/CD pipelines and modern DevOps practices.
Experience implementing SLOs, SLIs, and error budgets for large-scale production systems.
Excellent problem-solving, analytical, and troubleshooting skills.
Minimum 5+ years in Site Reliability Engineering, DevOps, or related roles.
Proven track record of leading observability initiatives in Kubernetes or cloud-native environments.
Experience working in high-growth, fast-paced startup or enterprise environments is a plus.
Fully remote with flexible working hours.
Expected overlap with UTC ±3 hours for team meetings and on-call rotations as needed.
Strong communication skills for cross-functional collaboration with engineering and product teams.
Ability to drive observability strategies and influence technical decisions across the organization.
Analytical mindset with the ability to translate metrics into actionable insights.
Self-motivated, detail-oriented, and proactive in identifying potential reliability issues.
Competitive salary and performance-based bonuses.
Flexible remote work with a fully distributed team.
Health, dental, and vision insurance (where applicable).
Professional development allowance for certifications, conferences, and courses.
Generous paid time off and parental leave.
Cutting-edge technology environment with opportunities to shape observability practices at scale.
At Startup Inno, you will be part of a forward-thinking team driving innovation in cloud-native technologies. We foster a culture of collaboration, continuous learning, and ownership. You will have the opportunity to implement best-in-class observability practices, influence key technical decisions, and make a tangible impact on the reliability and scalability of our products.
Interested candidates are invited to submit their resume and a cover letter highlighting relevant experience to us. Please include SRE Observability Lead – Remote in the subject line.