K
Kevin O'neill
$50/hr or $80,000/yr

Active over a week ago


Member since Feb 2026

Share this profile:

Staff-Track Backend Engineer

Software Engineer
Available for hire
Years of experience
7+ years
Experience level
Senior

Staff-track backend engineer with 7+ years of contract experience building and scaling distributed systems across healthcare, e-commerce, hospitality, and SaaS. Known for taking ownership of hard reliability and architecture problems, delivering measurable outcomes, and raising the engineering bar on every team. What sets this track record apart is not just technical depth but architectural judgment — knowing when a pattern like CQRS or an outbox solves the actual problem, when a query needs an index versus a data model redesign, and when the right call is to push back rather than ship. Across six engagements, a consistent pattern has emerged: identify the systemic issue behind the immediate pain, make the right architectural decision, and deliver something the team can rely on long after the contract ends.

Skills

No skills.

Languages

No languages.

Employment History

Staff-Track Backend Engineer at Stedi Current 2024 - Now
- Partner transactions were arriving out-of-order and causing duplicate processing; designed idempotent eventdriven pipelines on AWS Lambda + SQS/SNS with exactly-once guards and adaptive back-pressure, handling 7,200+ payer connections without data loss. - Integration drift between internal teams was causing weekly regressions; led OpenAPI-first contract design for X12/EDI endpoints (837P/I/D, 270/271), wired contract tests into GitHub Actions CI, and automated SDK generation — cutting integration-related incidents by ~70%. - Eligibility reads were timing out under query spikes; designed single-table DynamoDB access patterns alongside CQRS Postgres read models, bringing p95 read latency to under 10 ms while keeping analytics workloads fully decoupled. - External payer outages were triggering cascading failures and unnecessary pages; added circuit breakers, exponential-backoff retries, and dead-letter queues to all EDI/API call paths, reducing pager-fatigue incidents by ~60%. - Service boundaries lacked consistent auth controls, creating HIPAA exposure risk; implemented OAuth2 + JWT with fine-grained RBAC and automated KMS secrets rotation across all services, passing SOC 2 Type II audit with zero findings. - On-call engineers had no actionable signal during degradations; standardized structured CloudWatch logging, distributed traces, and Prometheus-style metrics with SLO dashboards covering p95/p99 latency and error budgets — reducing mean time to detect (MTTD) by ~45%. - Critical claim-submission flows had no tested failure playbooks; led threat-modeling and chaos drills (fault injection, payer-outage simulation), producing runbooks that cut mean time to resolve (MTTR) from ~40 min to under 15 min. - Enterprise customers struggled to onboard complex X12 EDI requirements; collaborated with network-ops and implementation-engineering teams to translate EDI schemas into developer-friendly JSON API contracts, reducing customer integration timelines by ~30%.
Senior Backend Engineer at Talon.One GmbH 2022 - 2024
- Real-time promotion evaluation was breaching SLAs during peak traffic (Black Friday, brand launches); profiled and re-engineered Go hot paths — goroutine pools, connection reuse, rule short-circuiting — reducing p99 evaluation latency by ~55% under 3× normal load. - Campaign authoring writes were contending with read-heavy evaluation queries and causing timeouts; proposed and implemented CQRS with a DDD-modeled rules engine (custom Go lisp dialect), decoupling write and read paths and eliminating evaluation timeouts during authoring peaks. - Coupon double-spend and reward duplication were eroding client trust; designed and shipped transactional outbox pattern with message-level deduplication for reward issuance, achieving exactly-once semantics and reducing fraud incidents to near zero. - High-risk rollouts to enterprise clients (Adidas, Sephora, Carlsberg) were too risky to ship all at once; built canary and shadow deployment pipelines in CI/CD, enabling staged validation and reducing production incidents from new releases by ~65%. - Promotion lookups were causing database saturation during flash-sale peaks; tuned Postgres with composite indexes and partitioning, layered Redis caching for hot campaign reads — sustaining 5× peak load without vertical scaling. - Alert fatigue was degrading on-call quality; established SLOs and error budgets, wired Prometheus alerts to scoped runbooks — cutting actionable-to-noise alert ratio by ~3× and reducing mean time to acknowledge (MTTA) by ~40%. - Mid-level engineers were producing inconsistent API designs and under-tested services; ran regular Go code reviews, load-testing workshops, and architecture design sessions across squads, measurably improving p99 latency baselines and test coverage on reviewed services. - Three major API versions needed to ship without breaking existing enterprise integrations; drove versioned, backward-compatible OpenAPI contract strategy across the promotion engine, enabling zero-breaking-change upgrades and unblocking client adoption in EMEA and APAC.
Senior Backend Engineer at Mews 2021 - 2022
- A tightly coupled .NET monolith was blocking independent deployments across reservations, invoicing, and payments; owned and drove the strangler-fig decomposition with Clean Architecture boundaries on .NET 5→6, enabling teams to ship independently and eliminating cross-domain deploy conflicts. - Payment and reservation workflows were losing events during downstream failures; redesigned them as async event-driven flows on Azure Service Bus topics, improving fault isolation and reducing lost-event incidents to zero over the final 6 months. - Schema drift with Mews Marketplace's 500+ partner integrations was causing weekly regression incidents; introduced OpenAPI-driven contracts with automated schema validation in CI, cutting partner integration regressions by ~80%. - High-read reservation endpoints were overloading Azure SQL under peak hotel check-in periods; added composite indexes, query-plan tuning, and Redis-backed response caching, contributing to a 2× reduction in p95 API response time achieved during the .NET 6 migration. - Deployments were causing brief downtime that violated hotel-operations SLAs; implemented zero-downtime blue/green slot swaps on Azure App Service with readiness probes and graceful shutdown in Kubernetes — achieving 100% zero-downtime deploys post-rollout. - Per-tenant authorization was inconsistent across services, creating cross-tenant data exposure risk; unified auth with OAuth2 scope-based access control and per-tenant RBAC, and centralized secrets in Azure Key Vault, remediating all critical findings from an internal security review. - Core guest journey reliability was invisible until customers reported issues; instrumented end-to-end distributed traces and error-budget dashboards in New Relic for booking, check-in, and payment flows — reducing customerreported reliability incidents by ~50%. - Analytics refresh was running on multi-hour batch cycles, making operational reports stale; designed and delivered Azure Databricks ingestion pipelines into the shared data warehouse, cutting analytics update latency from hours to under 15 minutes.
Backend Engineer at Fishbrain AB 2019 - 2020
- GraphQL and REST endpoints for user-generated content were timing out under angler-activity spikes; optimized Rails 6 query layers, added efficient cursor pagination and Pundit-scoped authorization caching, stabilizing p99 latency for 14 M+ user interactions. - Catch-data aggregation jobs were blocking the Rails process pool and degrading API response times; migrated ingestion pipelines to concurrent Go microservices, improving throughput ~3× and eliminating p99 degradation during aggregation runs. - Heavy analytics computations were spiking DB CPU during peak hours; designed Postgres composite indexes and offloaded computation to Sidekiq background jobs, keeping real-time API p99 latency flat during data-processing windows. - Primary DB was receiving all reads and writes equally, causing avoidable load spikes; implemented read-replica routing for reporting endpoints as part of the team's Aurora MySQL migration, reducing primary DB load by ~40% at peak without custom connection-routing code. - New features were rolling out to all 14 M users simultaneously with no rollback path; added feature-flag hooks and A/B test instrumentation to several endpoints, enabling the team to run controlled rollouts and make data-driven activation decisions. - Media-serving origin was absorbing high bandwidth cost from angler photo and video uploads; implemented S3 lifecycle policies and CloudFront CDN caching for the media pipeline, reducing origin load by ~55% and cutting egress spend materially. - The platform team had no proactive alerting on API errors or DB lag, reacting only to user reports; added structured logging and error-rate alerts on key endpoints and DB replica lag metrics — shifting incident detection from userreported to automated. - Product had limited visibility into where users dropped off in key funnels; instrumented conversion events and retention signals on several backend flows, enabling product experiments that improved feature activation in subsequent A/B tests.
Full Stack Engineer at Toggl OÜ 2018 - 2019
- The V8 time-entry API was a bottleneck for billing and reporting services due to synchronous coupling; built Go backend services with gRPC for low-latency inter-service communication, reducing reporting query latency by ~40% and enabling independent scaling. - Analytics and billing reconciliation depended on stale batch exports, causing delayed reporting; implemented PostgreSQL logical replication and CDC pipelines for near-real-time data feeds, cutting analytics staleness from hours to under 2 minutes. - Frequent, risky deployments to a monolithic GCE setup were limiting release confidence; contributed to the team's GKE migration by containerizing owned services and configuring HPA and PodDisruptionBudgets, helping increase deploy frequency and eliminate service-level downtime during releases. - Inconsistent error models and missing request validation across Go services were causing integration bugs; authored OpenAPI specs with middleware-level request validation and uniform error response structures for owned services, improving API contract consistency. - Infrastructure for owned GCP services was manually provisioned and not reproducible; wrote Terraform modules for GKE workloads, Cloud SQL instances, and Cloud Storage buckets, reducing manual setup time and enabling consistent environment creation. - Hot paths in the high-frequency time-entry write flow were causing latency spikes at scale; used RED/USE metrics and distributed tracing to identify bottlenecks, then optimized the critical write path — reducing average response time by ~35% on targeted endpoints. - Flaky tests were causing false-negative CI builds and slowing review cycles; identified root causes of the most frequent failures in owned test suites and fixed ~60% of them, contributing to a broader cross-team effort that cut the overall flaky-test rate from ~15% to under 2%. - Reporting dashboards showed stale data minutes after backend updates; built React UI components consuming updated WebSocket-backed API endpoints, reducing the visible data-freshness gap from minutes to seconds for active workspace users.
Software Engineer Intern at Storyblok 2018 - 2018
- Content publishing was failing silently when webhook payloads were malformed, confusing early customers; implemented TypeScript/Node.js validation middleware and error-response standardization for the webhook dispatch flow, reducing unhandled silent failures on assigned publishing endpoints. - Several GraphQL query fields lacked input validation, causing unpredictable backend errors; added schema-level validation and sanitization for assigned query types under senior guidance, improving error consistency on the content delivery API. - Content retrieval queries on large workspaces were noticeably slow; identified missing indexes on frequently filtered columns, wrote and tested Postgres migrations, and verified query-plan improvements — reducing p95 response time on the targeted retrieval endpoints. - The content API had no automated regression safety net for schema changes; wrote Jest unit and contract tests for the features I owned and integrated them into the CI pipeline, catching several schema regressions before they reached staging. - Local development environments differed from CI, causing 'works on my machine' failures; dockerized owned Node.js services using multi-stage images with consistent base versions and added readiness probes, aligning local and CI build behavior. - Code style was inconsistently enforced across the Node.js codebase, causing noisy PR reviews; set up ESLint rules and Husky pre-commit hooks for the services I worked on, reducing style-related review comments on my pull requests. - Handoff of completed features was slow because context was only in Slack threads; wrote concise documentation for each feature I shipped, covering API behavior, edge cases, and rationale — making it easier for the senior engineer reviewing my work to validate and merge faster.

Education

Bachelor of Science (B.S.) in Computer Science at Dublin City University 2014 - 2018