Machinify is the leading provider of AI-powered software products that transform healthcare claims and payment operations.
The company addresses the issue of over $200B in claims mispayments in the healthcare industry, which creates waste and frustration for patients, providers, and payers.
As a Data Engineer, you will transform raw external data into powerful, trusted datasets that drive payment, product, and operational decisions.
You will collaborate with product managers, data scientists, subject matter experts, engineers, and customer teams to build, scale, and refine production pipelines, ensuring data is accurate, observable, and actionable.
Your role will involve onboarding new customers and integrating their raw data into internal models.
The pipelines you create will power the company’s ML models, dashboards, and core product experiences.
You will design and implement robust, production-grade pipelines using Python, Spark SQL, and Airflow to process high-volume file-based datasets (CSV, Parquet, JSON).
You will lead efforts to canonicalize raw healthcare data into internal models and own the full lifecycle of core pipelines from file ingestion to validated, queryable datasets.
You will build resilient transformation logic with data quality checks, validation layers, and observability.
You will refactor and scale existing pipelines, tune Spark jobs, and implement schema enforcement aligned with internal data standards.
You will monitor pipeline health, participate in on-call rotations, and debug production data flow issues.
You will contribute to the evolution of the data platform and build streaming pipelines where needed to support near-real-time data needs.
You will help develop and champion internal best practices around pipeline development and data modeling.
Requirements:
You must have 4+ years of experience as a Data Engineer (or equivalent), building production-grade pipelines.
Strong expertise in Python, Spark SQL, and Airflow is required.
You should have experience processing large-scale file-based datasets (CSV, Parquet, JSON, etc.) in production environments.
Experience in mapping and standardizing raw external data into canonical models is necessary.
Familiarity with AWS (or any cloud) is required, including file storage and distributed compute concepts.
You should have experience onboarding new customers and integrating external customer data with non-standard formats.
The ability to work across teams, manage priorities, and own complex data workflows with minimal supervision is essential.
Strong written and verbal communication skills are necessary to explain technical concepts to non-engineering partners.
You should be comfortable designing pipelines from scratch and improving existing pipelines.
Experience working with large-scale or messy datasets (healthcare, financial, logs, etc.) is required.
Experience building or a willingness to learn streaming pipelines using tools such as Kafka or SQS is preferred.
Bonus: Familiarity with healthcare data (837, 835, EHR, UB04, claims normalization) is a plus.
Benefits:
You will have the opportunity to make a real impact, as your pipelines will directly support decision-making and claims payment outcomes from day one.
The role offers high visibility, allowing you to partner with ML, Product, Analytics, Platform, Operations, and Customer teams on critical data initiatives.
You will have total ownership of driving the lifecycle of core datasets powering the platform.
Your work will contribute to successful customer onboarding and data integration, providing a customer-facing impact.