Engineer
The Judge Group
Irving, TX, USA
Principal Engineer, Platform Engineering & Production Support
Locations: Irving, TX; Charlotte, NC; Minneapolis, MN
Work Model: Hybrid (3 days/week in-office)
Job Type: 12-month contract (potential extension or conversion)
Start Date: As soon as possible
About the Role
We are seeking a Principal Engineer to join our Platform Engineering team, focused on production support, reliability, and scalability of critical applications. This role is highly hands-on and requires deep expertise in DevOps and Site Reliability Engineering (SRE), with a strong focus on observability, incident management, and cloud-native environments.
You will work in a fast-paced, production-critical environment, ensuring application health, preventing outages, and improving system reliability through automation and modern engineering practices.
What You’ll Do
- Lead production support for a portfolio of 20+ applications, ensuring high availability and performance
- Design and implement monitoring, alerting, and observability solutions using tools like Splunk, Grafana, AppDynamics, and Prometheus
- Proactively identify risks through gap analysis, anomaly detection, and predictive alerting
- Troubleshoot complex issues in distributed microservices architectures and reduce mean time to resolution (MTTR)
- Drive adoption of SRE best practices, including automation, AIOps, and intelligent monitoring
- Support and scale applications running on OpenShift and cloud-native platforms
- Partner with development teams to ensure production readiness during release cycles
- Participate in an on-call rotation and respond to incidents with urgency and ownership
- Mentor engineers and elevate team capabilities in DevOps and platform engineering
- Serve as a technical leader managing competing priorities in a high-impact environment
Minimum Qualifications
- 7+ years of engineering experience or equivalent practical experience
- 10+ years of experience in platform engineering and production support
- 5+ years of experience with:
- Red Hat Linux, OpenShift, Kubernetes
- Java, Spring Boot, Python, microservices architectures
- Observability tools (Grafana, Splunk, AppDynamics)
- Incident management and alerting systems (AIOps, ServiceNow, BigPanda)
- 4+ years of experience with:
- Distributed systems and cloud-native architectures
- React.js, Kafka, Apache, and relational databases
Preferred Qualifications
- Experience in financial services or highly regulated industries
- Background in software development (especially Java-based ecosystems)
- Strong ability to operate across SRE, DevOps, and production support roles
- Demonstrated ability to manage multiple priorities in high-pressure environments
- Experience with proactive monitoring, automation, and reliability engineering practices
Work Environment & Schedule
- Hybrid model with three in-office days per week (8 hours per in-office day)
- Standard 40-hour workweek
- Typical working hours between 8:00 AM – 8:00 PM
- Monthly on-call rotation (with offshore support; minimal extended hours expected)
About the Team
The Platform Engineering team focuses on stabilizing, scaling, and operating applications post-deployment. This is an application-centric role (not traditional infrastructure support), emphasizing reliability, performance, and operational excellence in cloud environments.