Site Reliability Engineer
PhysIQ is a transformational leader in applying highly sophisticated technology to solve some of the most pressing problems in healthcare. More specifically, we are forging the frontier of healthcare delivery at the intersection of mobile technology and artificial intelligence. Our team is comprised of veteran technologists and world-class data scientists and our solutions set the market standard for scalability and sophistication. Furthermore, we are implementers with a proven track record of transforming an audacious technological vision into mission-critical solutions for our customers.
- Our core values are simple and are defined by integrity, passion, and relentless drive toward solving the impossible.
- We are a team in its purest definition. We all pull on the rope together, in the same direction, with the same intensity.
- Our customers and their patients depend on us to deliver technology that will forever change healthcare. We are literally keeping people out of the hospital. We are changing lives.
In our world, amazing things only happen when people make them happen. If you want to make things happen and do it with a world-class team of visionaries and doers, we encourage you to apply.
Members of the Engineering team at physIQ are highly motivated, engaged, curious, and bright. Guided by Agile/Scrum principles, much of our energy is focused on the continuous improvement of our team culture and development practices. We are dedicated to efficient delivery of value to our users and work closely in small cross-functional teams, engaging regularly with Product and other stakeholders to communicate issues and ensure alignment. We also support knowledge sharing among team members with specific technical skill sets and strive to foster those Communities of Practice so that we are always learning and growing. Innovation and collaboration are at the heart of our values and processes, and we believe that diversity—in all its forms—is the key to discovering new ways to contribute to the betterment of our team, our products, and the people who use them
We are looking for an analytically minded, experienced Site Reliability Engineer willing to work and innovate with the vigor of a fast-growing business. As a Site Reliability Engineer, you will be responsible for ensuring that deployment and integration tasks are completed successfully and for running cloud platforms smoothly for our cloud solutions to meet the business and platform requirements.
- Maintain operational availability, scalability, efficiency, monitoring, overall security, service reliability of the physIQ platform.
- Develop automation to continue to drive proactive detection and remediation of issues.
- Being the first responder to all alerts and incidents while managing communications between departments and handling incident documentation and postmortems
- Developing processes and identify opportunities for efficiency and automation.
- Manage task assignments to ensure commitments are delivered on time
- Prioritize operational deliverables with the Product Operation team
- Interface with development teams to communicate issues and recommend improvements in their respective applications
- Continuous monitoring of systems to ensure stability and reliability
- Incremental automation to streamline standard operating procedures
- Perform patch management and vulnerability remediation for Red Hat, Windows, Kubernetes, and 3rd party software.
- Continually assess the infrastructure for vulnerabilities and suggest de-risking efforts
- Execute on improvement efforts in information security and privacy
- Maintain production security system and audits
- Participate in incident management and on-call support rotation
- Should be open/flexible to work in a 24x7 environment.
- Foster our policies and procedures in the areas of incident response and monitoring
- Coordinate technical resolution of major incidents
- 4 or more years of hands-on system administrator experience Red Hat Linux and Kubernetes in an internet-facing production environment.
- Production experience supporting Kubernetes, Red Hat VM’s, IAM, VPC, buckets, and databases on public cloud (GCP preferred, AWS or Azure)
- Experience with Kubernetes deployments, stateful sets, config maps, and secrets
- Experience with configuring and consuming system and application monitoring (Prometheus, Grafana, Zabbix, NewRelic/DataDog/Dynatrace)
- Experience with SOC-2, NIST, or other applicable security frameworks
- Understanding of cloud infrastructure security and privacy tools
- Thorough understanding of networking concepts and Internet protocols (TCP/IP, ssh, HTTP/HTTPS, NAT, firewalls, load balancers)
- Experience performing deployments Kubernetes deployments from Helm charts
- Deploying REST API endpoints to API Gateways for backend Java services (Kong preferred, Apigee,)
- Hands-on experience with infrastructure as code tools (Terraform preferred, Cloud Deployment Manager, CloudFormation)
- Manage Linux VM’s using configuration management tools (Salt preferred, Ansible, Puppet, or Chef)
Nice to have
- Experience with Apache Flink, Apache Kafka, Apache Cassandra, Gitlab CI, SonarQube
- Exceptional interpersonal skills including teamwork, facilitation, and negotiation
- Excellent English written, verbal, and presentation skills
- Ability to understand business context and requirements
- Ability to hit the ground running and provide value immediately
- Willingness to work in a fast-paced environment
- Ability to learn quickly and adapt to changing priorities and requirements
- Effective at driving short-term actions that are consistent with long-term goals