Share this Job

System Reliability Engineer

Apply now »

Date: May 23, 2021

Location: Heredia, H, CR, 40101

Company: GFT Technologies SE

System Reliability Engineer

Our Ideal Candidate:


Is ready to join a diverse, highly technical & dynamic team of engineers that build and deliver critical internal, infrastructure services. Our engineers ensure that our services meet the needs of our customers with the desired levels of reliability, performance, and availability by developing continuous improvements, tools, and automation. We're looking for people passionate about technology while bringing a business mindset to their every day, who have a high sense of ownership and accountability, with the ability to influence those around them, have intellectual curiosity, and who are passionate for performance debugging and benchmarking.


The role and Responsibilities:


Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to our engineering principles. SRE is also an engineering approach to building and running production systems, we engineer solutions to operational problems. As SREs are responsible for overall system operation, we use a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless postmortems, proactive identification, and prevention of potential outages



. • Deploy and maintain production environments that require 24/7 availability

• Ensure our services meet reliability, stability, performance, security, and availability requirements

• Develop performant, scalable, and maintainable software solutions and tools for internal use

• Perform proactive troubleshooting & performance analysis of Snap services and cloud environments

• Diagnose and help build robust, self-healing features and automation that reduce operational effort and improve service uptime

• Participate in programs to deploy prerelease products/codes in production and provide direct feedback to product development teams

• Interact with other product & development teams, gather requirements, and perform analysis to determine appropriate solutions

• Actively engage in design reviews, code reviews, and operational reviews

• Develop strategic design and requirements on small systems or modules of large systems

• Effectively analyze issue route causes and contribute to unfamiliar code written by team members

• Participate in the estimation process, use case specifications, reviews of test plans and test cases, requirements, and project planning

• Introduce tools and automate repetitive processes to reduce operation burden


Skills Required


 • Degree in systems engineering or related field

• 4-5 years of experience in DevOps/SRE

• Experience in benchmarking and performance analysis of parallel applications & workflows

• Broad experience in programming languages Java, Python, NodeJS

• Strong experience in public cloud platforms (AWS and Azure)

• Proficient in containerization (Kubernetes, Docker)

• Networking experience load balancing, network security, standard network protocols (HTTP/s, DNS, etc.)

• API lifecycle management and message bus technologies experience

• Operating system experience Linux, Windows

• Experience in monitoring and data analytics tools such as Prometheus, Grafana, Dynatrace, OpenTelemetry, ELK

• Experience in project coordination, SLA adherence, and hands-on end-to-end software development lifecycle using SAFe, Agile & DevOps practices and supporting applications in production for data-centric applications

• Proficiency in CI/CD tooling Jenkins, GitLab, and others

• Experience deploying infrastructure as a code (Terraform