Apply Now

Requirement ID: 91321
Job Title: Senior Data Engineer — Platform, ML & Analytics
Job Type: Contract
Duration: 6 - 9 months
Location: Rhode Island/Remote (USA)
Job Description:

Core Responsibilities
• Own and operate the raw ingestion pipeline: stream Health Data, Local Temp/News, Benefits Data, and Click-Thru
External Files (CSV/Parquet) via Kafka and Dataflow Streaming Ingest into Cloud Storage Raw Buckets and BigQuery
ingestion datasets.
• Build and maintain the transformation pipeline: execute ETL/ELT jobs using Dataflow and Dataproc (Apache Spark /
DataSpark) to produce cleaned and normalized BigQuery datasets from raw sources.
• Drive normalization to curated data marts: produce denormalized 360-degree data views in BigQuery across Rx,
Benefits, behavioral, and clinical signals for downstream Feature Store ingestion.
• Own the Feature Store population pipeline: ingest Behavior Signals, Insurance Coverage Signals, Clinical Signals,
Engagement Signals, Rx Signals, and Contextual Features from curated BigQuery data marts into the GCP Feature
Store.
• Design and maintain ML training dataset pipelines for NBA and AI Insight models: offline batch paths, online serving
feature paths, training/eval splits, and dataset versioning.
• Integrate Adobe Analytics event data as a behavioral signal source, aligning it with clinical and benefits data for
multi-source model training.
• Operate and tune Dataproc Spark Jobs (DataSpark) for large-scale feature engineering and model training data
preparation.
• Monitor feature freshness, training data drift, and model data quality in partnership with Vertex AI pipelines.
• Implement and monitor data quality checks, SLAs, and alerting across all pipeline stages; implement schema validation
and anomaly detection.
CVS Digital | AI Insight & NBA Engine Data Layer — Job Descriptions
For Recruitment Use Only — Not for Distribution Page 3
• Manage schema evolution, partitioning strategies, and cost optimization for BigQuery tables.
• Design and implement disaster recovery (DR) zones for the data platform: define RTO/RPO targets, configure
cross-region BigQuery dataset replication, set up Cloud Storage DR buckets, replicate Feature Store snapshots, and
document and test failover runbooks.
• Implement data archival strategies across all tiers: design lifecycle policies for BigQuery table expiration, Cloud Storage
object tiering (Nearline/Coldline/Archive), archive historical ML training snapshots, and ensure HIPAA-mandated
retention windows are met for PHI-regulated datasets.
• Enforce HIPAA-compliant PHI handling across all pipeline and ML stages: apply PHI data classification, implement
field-level encryption and masking for Protected Health Information, apply de-identification techniques (Safe Harbor or
Expert Determination) before model training, manage access controls and audit logging per the HIPAA Security Rule,
and ensure PHI is never written to unencrypted storage or non-compliant destinations.
• Collaborate with backend domain engineers to define Kafka event schemas that feed ML operations.
• Document pipeline architecture, data contracts, Feature Store definitions, training data lineage, DR runbooks, and
archival policies.
 

Required Qualifications
• 7–9 years of hands-on data engineering or ML data engineering experience in a production GCP environment.
• Strong proficiency in Python, Java, or Node.js for pipeline development, feature engineering scripts, and automation.
• Strong hands-on experience with BigQuery (partitioning, clustering, cost management, complex SQL, ML-optimized
table design).
• Proficiency with Apache Kafka for real-time streaming ingestion.
• Experience with Dataflow (Apache Beam) for both streaming and batch pipelines.
• Proficiency with Apache Spark (PySpark or Scala); DataSpark experience a strong plus.
• Solid familiarity with GCP ecosystem: Cloud Storage, Pub/Sub, Dataproc, Cloud Composer/Airflow.
• Experience building ML training pipelines and Feature Stores (GCP Feature Store preferred); understanding of the ML
lifecycle including feature engineering, data versioning, and train/eval splits.
• Experience with Vertex AI Pipelines or similar MLOps tooling.
• Demonstrated experience designing disaster recovery zones and failover strategies for cloud data platforms:
cross-region replication, RTO/RPO definition, and DR testing.
• Experience with data archival design: BigQuery table lifecycle management, Cloud Storage tiered storage policies, and
long-term retention for regulated datasets.
• Hands-on experience handling PHI under HIPAA: field-level encryption and masking, de-identification techniques, audit
logging, access control policies, and HIPAA Security Rule compliance for data at rest and in transit.
• Strong SQL and data modeling skills; experience with layered data lake or lakehouse architecture.
 

Nice to Have
• Experience integrating Adobe Analytics data streams or Adobe Experience Platform.
• Familiarity with Looker or Vertex AI as downstream consumers.
• Knowledge of Click-Thru file formats and external table patterns in BigQuery.
• Experience with GCP CMEK (Customer-Managed Encryption Keys) for PHI dataset protection.
• Familiarity with HIPAA BAA requirements in cloud vendor agreements.
CVS Digital | AI Insight & NBA Engine Data Layer — Job Descriptions
For Recruitment Use Only — Not for Distribution Page 4
• Familiarity with NIST or HITRUST frameworks as applied to ML data pipelines.
• Background in healthcare data: Rx, clinical, or benefits domain

 

Apply Now