STAR ETL Master

Description

The steps below describe how to run both the STAR ETL Main process (aka STAR Quarterly Refresh) and the STAR COVID-19 mini-CDM. Aside from a few configuration differences, the processing steps are the same for both.

Technical Info

Server Names

  • Unix Development:  plt-nosql01.dhe.duke.edu
  • Unix Production:  plp-nosql01.dhe.duke.edu
  • Windows (both Dev and Prod): vwp-pmndmc.dhe.duke.edu

Prerequisites

  • Must have an Airflow user account for the particular server. Contact ACE Data Engineering team to request access.
  • Must have RDP access to vwp-pmndmc.dhe.duke.edu. Create ServiceNow ticket to request access.

Processing Steps

  1.  For the Quarterly refresh only, update the latest RxNorm data for the medication subject area processing.
    1. Login using PuTTY terminal (or SSH GUI) to same server.
    2. Change directory to /dwhome/airflow_ci/master/dags/CDRN_ETL/RxNormLoad/
    3. Run file RxNormLoad.sh
  2. Login to the appropriate Airflow server via a web browser
    1. Development: http://plt-nosql01.dhe.duke.edu:8083/admin/
    2. Production: http://plp-nosql01.dhe.duke.edu:8083/admin/
  3. Edit the Airflow variable CDRN_ETL_Options_STAR
    1. Click Admin→Variables at the top of the Airflow web page.
    2. Click the Edit icon (pencil icon) for the entry CDRN_ETL_Options_STAR
  4. Make the following changes to these entries in the Airflow variable CDRN_ETL_Options_STAR
    1. loadToDt to the previous day for the mini-CDM refresh or the last day of the previous month for the quarterly refresh in YYYY-MM-DD format
    2. network to CVD19 for the mini-CDM refresh or STAR for the quarterly refresh

    3. prevPrefix to CVDNEW_ for the mini-CDM refresh or the previous V##_ prefix used for the last STAR refresh

    4. newPrefix to CVDNEW_ for the mini-CDM refresh or the next number available for V##_ prefix for the the current STAR refresh

    5. tempPrefix to ETL_

  5. Start DAG process
    1. Click DAGs at the top of the Airflow web page.
    2. Scroll down the DAG list to the entry for STAR_ETL_Master.
    3. On the right side of the line, there are several icons. Click the Play button (Trigger DAG).
    4. Click OK on the confirmation popup window. This will start the DAG process.
  6. Monitor DAG process
    1. Click on the link for the DAG STAR_ETL_Master from the main Airflow dashboard. This will open the DAG details page.
    2. Click either Graph View or Tree View to see the progress status of the DAG.
      1. Graph View - Will display a graph representation of all task being run and status of each task (Running, Finished, Failed, etc.) Does not display Task Duration Time correctly when hovering over the task. 
      2. Tree View - Will display a tree representation of specific amount of executions, including current execution. Task Duration Time does display correctly here when hovering over the task.
    3. Usual run time for entire process is 10-12 hours.
  7. For quarterly refresh only, generate STAR hash token data.
    1. Login using PuTTY terminal (or SSH GUI) to same server.
    2. Change directory to /dwhome/airflow_ci/master/dags/CDRN_ETL/Datavant/
    3. Run file hash_token_load_master.sh STAR
  8. For quarterly refresh only, execute post-ETL process
    1. Click DAGs at the top of the Airflow web page.
    2. Scroll down the DAG list to the entry for STAR_ETL_PostOp
    3. Run following SQL to backfill any missing hash tokens from previous step:
      1. INSERT INTO HSC_PCOR.<<V##>>_HASH_TOKEN 
        SELECT PATID, 'XXX - S00000', 'XXX - S00000', 'XXX - S00000', 'XXX - S00000', 'XXX - S00000', 'XXX - S00000', 'Backfill'
        FROM HSC_PCOR.<<V##>>_DEMOGRAPHIC WHERE PATID IN
        (
        SELECT PATID FROM HSC_PCOR.<<V##>>_DEMOGRAPHIC vd
        minus
        SELECT PATID FROM HSC_PCOR.<<V##>>_HASH_TOKEN
        );
    4. Once DAG finishes, login using PuTTY terminal (or SSH GUI) to same server.
    5. Change directory to /dwhome/apps/cdrnLoad
    6. Run following command to generate optimization stats for PCORNet tables
      1. nohup sqlplus 'HSC_PCOR/<<PASSWORD>>'@eclrprd1 @gatherStats_HSC_PCOR.sql &

  9. For mini-CDM only, generate and send N3C data
    1. Login to the Windows server vwp-pmndmc.dhe.duke.edu via Remote Desktop Protocol (RDP).
    2. Open Windows File Explorer.
    3. Navigate to directory to D:\Apps\N3C\Project\
    4. Double click on file N3C_Main.bat to start the N3C data script.
    5. Usual run time for batch file is 8-10 hours.
    6. Can check for errors looking at log file H:\N3C\Project\N3C_debug.log
  10. For mini-CDM only, generate and send N3C hash tokens
    1. Go back to the Airflow server from step 1.
    2. Scroll down the DAG list to the entry for STAR_N3C_Hash_Tokens.
    3. Login using PuTTY terminal (or SSH GUI) to same server.
    4. Change directory to /dwhome/airflow_ci/master/dags/CDRN_ETL/Datavant/
    5. Run file hash_token_load_master.sh N3C
    6. Transfer ZIP file created from previous step to local computer
    7. Log into N3C hash token SFTP site (Setup required), 34.132.183.26:2222
      1. Need SFTP private key file
    8. Transfer ZIP file to N3C hash token SFTP site

Detailed Information

SQL Script Migration


Subject AreaSQL Scripts (Links to Git Repository Location)Notes

Converted to SQL Server?

1Python DAGSTAR_ETL_Master.pyEmbedded SQL within Python code. May be able to pull out and place into own SQL file.
2COVID-19CVD19GetPatKeys.sql

3InfrastructureSTAR_create_etl_tables.sql

4InfrastructureSTAR_create_etl_tables_CDM61.sql

5InfrastructureSTAR_create_indexes.sql

6InfrastructureSTAR_replace_etl_views.sql

7Address Historycdrn_addr_info.sql

8Encountercdrn_all_enc_clin.sql

9Encountercdrn_all_enc_hosp.sql

10Encountercdrn_all_enc_oa.sql

11Medicationcdrn_meds_info.sql

12Infrastructurecompare_rpt.sql

13Conditioncondition.sql

14Deathdeath.sql

15Death Causedeath_cause.sql

16Infrastructuredemo_rpt.sql

17Demographicdemographic.sql

18Diagnosisdiagnosis.sql

19Dispensingdispensing_per_year.sql

20Dispensingdispensing_per_year_Surescript_Step1.sql

21Dispensingdispensing_per_year_Surescript_Step2.sql

22Dispensingdispensing_use_inc_load.sql

23

Encounter

encounter.sql

24Encounterencounter_dedup.sql

25Enrollmentenrollment.sql

26

Harvest

harvest.sql

27Harvestharvest_CDM61.sql

28Hash Tokenhash_token.sql

29Hash Tokenhash_token_CDM61.sql

30Hash Tokenhash_token_N3C.sql

31Hash Tokenhash_token_backfill.sql

32Immunizationimmunization.sql

33Lab Resultlab_result_cm_all_years.sql

34Address Historylds_address_history.sql

35Medication Administrationmed_admin.sql

36Observation, Clinicalobs_clin_all.sql

37Observation, Clinicalobs_clin_flowsheet.sql

38Observation, Clinicalobs_clin_pft.sql

39Observation, Clinicalobs_clin_smoking.sql

40Observation, Clinicalobs_clin_vitals.sql

41Observation, Clinicalobs_clin_vitals_legacy.sql

42Observation, Generalobs_gen.sql

43Observation, Generalobs_gen_adt.sql

44Observation, Generalobs_gen_sdoh.sql

45Observation, Generalobs_gen_vent.sql

46PCORNet Trialpcornet_trial.sql

47Prescribingprescribing.sql

48Prescribingprescribing_mixture_broken_out.sql

49Prescribingprescribing_update_rx_basis.sql

50Address Historyprivate_address_geocode.sql

51Address Historyprivate_address_history.sql

52Demographicprivate_demographic.sql

53Demographicprivate_demographic_dedup.sql

54PRO CMpro_cm.sql

55PRO CMpro_cm_sdoh.sql

56Procedureprocedures.sql

57Procedureprocedures_caboodle.sql

58Providerprovider.sql

59Referralreferral.sql

60Infrastructurerefresh_stats_update.sql

61Infrastructureremove_dup_keys.sql

62Infrastructurerename_etl_tables_no_drop.sql

63Infrastructurerename_etl_tables_w_drop.sql

64Vitalvital.sql