| Job Description: |
Core Responsibilities • Own and operate the raw ingestion pipeline: stream Health Data, Local Temp/News, Benefits Data, and Click-Thru External Files (CSV/Parquet) via Kafka and Dataflow Streaming Ingest into Cloud Storage Raw Buckets and BigQuery ingestion datasets. • Build and maintain the transformation pipeline: execute ETL/ELT jobs using Dataflow and Dataproc (Apache Spark / DataSpark) to produce cleaned and normalized BigQuery datasets from raw sources. • Drive normalization to curated data marts: produce denormalized 360-degree data views in BigQuery across Rx, Benefits, behavioral, and clinical signals for downstream Feature Store ingestion. • Own the Feature Store population pipeline: ingest Behavior Signals, Insurance Coverage Signals, Clinical Signals, Engagement Signals, Rx Signals, and Contextual Features from curated BigQuery data marts into the GCP Feature Store. • Design and maintain ML training dataset pipelines for NBA and AI Insight models: offline batch paths, online serving feature paths, training/eval splits, and dataset versioning. • Integrate Adobe Analytics event data as a behavioral signal source, aligning it with clinical and benefits data for multi-source model training. • Operate and tune Dataproc Spark Jobs (DataSpark) for large-scale feature engineering and model training data preparation. • Monitor feature freshness, training data drift, and model data quality in partnership with Vertex AI pipelines. • Implement and monitor data quality checks, SLAs, and alerting across all pipeline stages; implement schema validation and anomaly detection. CVS Digital | AI Insight & NBA Engine Data Layer — Job Descriptions For Recruitment Use Only — Not for Distribution Page 3 • Manage schema evolution, partitioning strategies, and cost optimization for BigQuery tables. • Design and implement disaster recovery (DR) zones for the data platform: define RTO/RPO targets, configure cross-region BigQuery dataset replication, set up Cloud Storage DR buckets, replicate Feature Store snapshots, and document and test failover runbooks. • Implement data archival strategies across all tiers: design lifecycle policies for BigQuery table expiration, Cloud Storage object tiering (Nearline/Coldline/Archive), archive historical ML training snapshots, and ensure HIPAA-mandated retention windows are met for PHI-regulated datasets. • Enforce HIPAA-compliant PHI handling across all pipeline and ML stages: apply PHI data classification, implement field-level encryption and masking for Protected Health Information, apply de-identification techniques (Safe Harbor or Expert Determination) before model training, manage access controls and audit logging per the HIPAA Security Rule, and ensure PHI is never written to unencrypted storage or non-compliant destinations. • Collaborate with backend domain engineers to define Kafka event schemas that feed ML operations. • Document pipeline architecture, data contracts, Feature Store definitions, training data lineage, DR runbooks, and archival policies. Required Qualifications • 7–9 years of hands-on data engineering or ML data engineering experience in a production GCP environment. • Strong proficiency in Python, Java, or Node.js for pipeline development, feature engineering scripts, and automation. • Strong hands-on experience with BigQuery (partitioning, clustering, cost management, complex SQL, ML-optimized table design). • Proficiency with Apache Kafka for real-time streaming ingestion. • Experience with Dataflow (Apache Beam) for both streaming and batch pipelines. • Proficiency with Apache Spark (PySpark or Scala); DataSpark experience a strong plus. • Solid familiarity with GCP ecosystem: Cloud Storage, Pub/Sub, Dataproc, Cloud Composer/Airflow. • Experience building ML training pipelines and Feature Stores (GCP Feature Store preferred); understanding of the ML lifecycle including feature engineering, data versioning, and train/eval splits. • Experience with Vertex AI Pipelines or similar MLOps tooling. • Demonstrated experience designing disaster recovery zones and failover strategies for cloud data platforms: cross-region replication, RTO/RPO definition, and DR testing. • Experience with data archival design: BigQuery table lifecycle management, Cloud Storage tiered storage policies, and long-term retention for regulated datasets. • Hands-on experience handling PHI under HIPAA: field-level encryption and masking, de-identification techniques, audit logging, access control policies, and HIPAA Security Rule compliance for data at rest and in transit. • Strong SQL and data modeling skills; experience with layered data lake or lakehouse architecture. Nice to Have • Experience integrating Adobe Analytics data streams or Adobe Experience Platform. • Familiarity with Looker or Vertex AI as downstream consumers. • Knowledge of Click-Thru file formats and external table patterns in BigQuery. • Experience with GCP CMEK (Customer-Managed Encryption Keys) for PHI dataset protection. • Familiarity with HIPAA BAA requirements in cloud vendor agreements. CVS Digital | AI Insight & NBA Engine Data Layer — Job Descriptions For Recruitment Use Only — Not for Distribution Page 4 • Familiarity with NIST or HITRUST frameworks as applied to ML data pipelines. • Background in healthcare data: Rx, clinical, or benefits domain
|