Site Reliability Engineer
Site Reliability Engineer
Hanoi, VN, 10000 Ho Chi Minh City, VN, 700000
What do we do?
As a pioneer for digital transformation GFT develops sustainable solutions across new technologies – from cloud engineering and artificial intelligence to blockchain/DLT. With its deep technological expertise, strong partnerships, and comprehensive market know-how GFT offers advice to the financial and insurance sectors, as well as in the manufacturing industry. Through the intelligent use of IT solutions GFT increases productivity and creates added value for clients. Companies gain easy and safe access to scalable IT-applications and innovative business models.
Who are we?
Having started in Germany in 1987, GFT Technologies has grown to become a trusted Software Engineering and Consulting specialist for the international financial industry, counting many of the world’s largest and best-known Banks as our clients. We are an organization that empowers you to not only explore but raise your potential and seek out opportunities that add value. At GFT, diversity, equality, and inclusion are at the core of who we are. Ensuring a diverse and inclusive working environment for all communities is one of the main pillars of our diversity strategy, based on our core values and culture. We have been certified for 2022/23 as a ‘Great place to work’ in the APAC region. So, if you want to have the opportunity to work with an outstanding and progressive organization this position could be right for you.
Role Summary
As a Site Reliability Engineer (SRE) you will play a critical role in ensuring the reliability, availability, and performance of our systems and services. You will work closely with development and operations teams to build and maintain observable, scalable, reliable infrastructure on AWS, utilizing Kubernetes for orchestration and Python for automation. Proficiency in resilience testing and capacity management is also essential for this role.
Key Responsibilities
- Develop and maintain automation scripts and tools using Python to improve operational efficiency.
- Conduct resilience testing to ensure system reliability under varying conditions.
- Perform capacity planning and management to ensure systems can handle growth and peak demand.
- Monitor system performance and reliability, and respond to incidents to minimize downtime.
- Collaborate with development teams to ensure best practices for observing, deploying and maintaining applications.
- Implement and manage monitoring and alerting solutions to proactively identify and resolve issues.
- Participate in on-call rotations to provide 24/7 support for critical systems.
Required Skills and Qualifications
- Strong experience with AWS, including services such as EC2, EKS, Lambda, S3, RDS, and VPC.
- Proficiency in managing and orchestrating containerized applications using Kubernetes.
- Solid scripting skills in Python for automation and tool development.
- Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation.
- Experience with resilience testing methodologies and tools.
- Proven ability to perform capacity planning and management.
- Familiarity with logging and monitoring tools like Sumologic, Prometheus, Grafana, or similar.
- Excellent problem-solving skills and the ability to troubleshoot complex issues in a distributed system.
- Strong communication and collaboration skills, with the ability to work effectively in a team environment, and to stay calm and composed in high-pressure situations.
- Ability to work with application teams to guide the design of SLO’s.
Preferred Skills
- Experience with OpenTelemetry, Prometheus, and Sumologic for observability and monitoring.
- Familiarity with incident management tools such as PagerDuty.
- Experience with Jira and Confluence for project management and documentation.
- Knowledge of CI/CD pipelines and experience with tools such as Harness.
- Understanding of modern development frameworks and languages, including Kotlin, Spring Boot, Kafka, and Postgres.
What can we offer you?
- Competitive salary
- 13th-month salary guarantee
- Performance bonus
- Professional English course for employees
- Premium health insurance