Platform • SRE • Automation • AI

Application Engineer (Platform / Product SRE)

Application Engineer (Platform/Product SRE) with 13+ years of experience improving reliability, performance and operational efficiency across large-scale environments. Strong background in cloud platforms (AWS, Kubernetes, OpenShift), infrastructure as code, observability and automation. Completed postgraduate studies in AI & Machine Learning (Texas), providing a solid foundation in applied ML and emerging AI-driven automation patterns. Currently advancing skills in Agentic AI approaches and exploring how they can be safely incorporated into SRE workflows for intelligent alerting, incident assistance and self-healing capabilities.

AWS • Kubernetes • OpenShiftTerraform • Ansible • CI/CDObservability • SLOs • DRAI & ML • Agentic AI (learning)

Quick snapshot

📍 Based in Reading, UK
🏢 Application Engineer (Platform / Product SRE)
☁️ AWS, OpenShift, Kubernetes, Azure DevOps
📊 Observability, SLOs & reliability
🧠 PG in AI & ML (Texas) • Agentic AI learner

About

I'm an Application Engineer working in the Platform/Product SRE space, with over 13 years of experience improving reliability, performance and operational efficiency across large-scale environments. My work spans cloud migrations, Kubernetes and OpenShift platforms, observability, DR automation and SRE best practices.

I enjoy building tools and frameworks that help product teams ship safely and operate confidently: from deployment inspectors and automated dashboards to DR utilities and chaos experiments. I like problems where reliability, automation and clean engineering meet.

With a postgraduate background in AI & Machine Learning from Texas, I'm now focused on how **Agentic AI** patterns can be applied responsibly within SRE teams—for example, in intelligent alert enrichment, guided incident response and self-healing workflows, always with reliability and safety as the first concern.

Skills

Cloud & Platform

• AWS (EC2, RDS, EKS, VPC, CloudWatch)
• OpenShift / Kubernetes
• Azure DevOps (pipelines, repos)
• Terraform, Ansible, Helm

SRE & Observability

• SLIs • SLOs • Error Budgets
• Journey-based monitoring
• DataDog, ELK / Kibana, Grafana, Prometheus
• Incident response, RCA, resilience testing

Automation & CI/CD

• Jenkins, GitHub/GitLab CI, Azure DevOps
• Python, Shell, Groovy
• Deployment automation & platform tooling
• DR runbooks and automation

Foundations & AI Focus

• Linux systems, networking fundamentals
• ITIL, Agile ways of working
• Architecture & design reviews, mentoring
• PG in AI & ML (Texas) • Agentic AI (learning & exploration)

Experience

Discover Financial Services UK

Application Engineer (Platform / SRE) • Nov 2021 – Present • Reading, UK

• Built Postgres Analyzer automation to improve cost visibility and optimisation across AWS RDS fleets.
• Developed OCP Deployment Inspector to help product teams validate deployments against platform best practices for resiliency, probes, resources and security.
• Developed a Disaster Recovery automation framework (traffic flip, validation and reporting), significantly reducing manual effort and improving repeatability.
• Automated observability dashboards in Kibana for new and existing microservices, standardising logging and reducing onboarding time.
• Implemented chaos experiments using Chaos Toolkit to validate failover and error-handling behaviour for workloads on the platform.
• Designed a zero-downtime PostgreSQL upgrade process ensuring continuous replication and workload continuity.
• Supported PCF → OCP migrations, modernising application deployments, environments and monitoring.
• Defined SLIs/KPIs aligned with SLOs and contributed to journey-based monitoring for customer-critical flows.
• Acted as on-call engineer in RRT, leading incident response and driving RCAs to reduce MTTR and prevent repeat issues.
• Leveraging postgraduate AI/ML knowledge to evaluate feasibility and guardrails for emerging AI-enabled automation opportunities within SRE.
• Actively learning and experimenting with Agentic AI patterns to identify workflows where intelligent orchestration and operational decision-support can reduce manual toil and improve reliability signals.

Vodafone UK

Senior Site Reliability Engineer / Site Reliability Engineer • Sep 2016 – Nov 2021

• Executed large-scale migrations from on-prem infrastructure to AWS using Terraform and automated CI/CD pipelines.
• Deployed and managed Kubernetes clusters (EKS and on-prem) including automation via Ansible.
• Built observability dashboards using Prometheus, Splunk, AppDynamics and Datadog to align infra and app metrics.
• Implemented Azure DevOps pipelines for continuous delivery to AWS workloads.
• Performed capacity analysis, availability reviews and resilience improvements across environments.
• Handled incident response, root-cause analysis and post-incident automation to improve service availability.

Early Career — DevOps Consultant / Middleware Specialist

2012 – 2016

• Delivered CI/CD transformations using Jenkins, Maven and Git to reduce manual deployment effort.
• Automated Linux server provisioning and configuration using Kickstart, Chef and Ansible.
• Installed and supported enterprise middleware (WebLogic, WebSphere, Apache) in production environments.
• Provided proactive troubleshooting and monitoring to maintain platform stability and performance.

Selected Work & Tools

OCP Deployment Inspector

Platform tool that inspects OpenShift namespaces and deployments against best practices for resiliency, probes, resources and image policies, giving teams a clear view of deployment health.

Python • OpenShift • Kubernetes • CI/CD

Postgres Analyzer

Automation to review Postgres instances and configuration across regions, enabling cost optimisation and operational visibility for RDS fleets.

Python • AWS RDS • Reporting

DR Automation Framework

Utility to orchestrate DR activities including traffic flip, validation checks and reporting, reducing manual effort and improving confidence in failover plans.

AWS • Scripting • Runbooks • Automation

Observability Dashboard Factory

Automated creation of Kibana dashboards for services, helping teams get consistent logging views with minimal setup effort.

Kibana / ELK • Automation

Contact

I'm open to conversations around platform engineering, SRE, observability, automation and AI-assisted operations.

📧 bhargav.sutapalli@gmail.com

📍 Reading, United Kingdom