Commonly-Linked Data Assets
The following files are often linked to insurance claims during analysis. It is important to identify the files that your study will use prior to requesting data for NC Medicaid, SEER-Medicare, or CMS assets since you are only permitted to link those assets listed in an approved DUA application.
ResearchDataGov.org aggregates restricted federal data that is accessible through a standard application process. The agencies providing data assets can be found here, and include the Census Bureau, Bureau of Labor Statistics, National Center for Health Statistics, and SAMHSA Center for Behavioral Health Statistics and Quality.  Resources are filterable by agency, topic, linking variable, and availability of a public use file (PUF). Please work with the appropriate Duke contracting offices, IRBs, and data security teams before employing these data.
National Center for Health Statistics (NCHS) survey data linked to Medicaid data (requires CDC approval)
National Center for Health Statistics (NCHS) survey data linked to Medicare data (requires CDC approval)
National Hospital Care Survey (NHCS) linked to Medicare data (requires CDC approval)
Â
This government report on different socioeconomic status (SES) measures describes several measures that are not listed below but may be pertinent to your study and DUA application. Among other things, it compares SES measures and categorizes area-level SES measures by level of disaggregation (census block, census tract, ZIP, county, person-level).
Â
The table below provides key information about data assets that are frequently utilized in DataShare analyses.
Commonly-Linked File | Cost1 | Reason for Use | Justification Language for the DUA | How It Is Linked to the Primary Data | File Location | Data Documentation |
---|---|---|---|---|---|---|
Yes; see DataShare price list | Has information regarding hospitals.2
| AHA Annual Survey Files will be used to obtain detailed information about hospital systems and services, such as organizational structure, facilities and services, beds and utilization, staffing, expenses, physician arrangements, system affiliation, geographic indicators, and accreditations and approval codes by credentialing organizations. | Linked via hospital provider ID (PROVIDER) in the claims files | Â | Â | |
Free | Geography-based ranking based on SES variables, such as theoretical domains of income, education, employment, and housing quality. | Area Deprivation Index files will be linked via the ZIP+4 and will provide a measure of neighborhood socioeconomic status (SES) disadvantage for evaluation of SES impact on outcomes. Justification for EDB file: [Year/percent] EDB 9 digit ZIP code files contain a key variable needed for linkage to the detailed geographic measures present in the Area Deprivation Index... (also list other linked geography files, if applicable) | Linked via ZIP code + 4 or census tract Medicare: Must also request the EDB 9 digit ZIP code files for same the year/percent of your claims files NC Medicaid: Not usable because most 9-digit ZIP codes in the data end in “0000” and there is not a census tract variable | |||
American Medical Association (AMA) Physician Masterfile | Contains information about physicians, including:
Does not contain race or ethnicity. | The AMA Physician Masterfile data will be used to obtain taxonomy (specialty) information and other physician characteristics. | Research ID if linked to SEER-Medicare NPI for other data sources | No communal files. Must be purchased through the NCI and from MMS, Inc. on a per-DUA basis when linking SEER-Medicare data. | ||
Free | Geographic boundary files for mapping Basic healthcare utilization rates for different geographic aggregations | The Dartmouth Atlas of Health Care will be used to obtain geographic boundaries for mapping and basic healthcare utilization rates for geographic aggregations. | Linked via patient or provider ZIP code in the claims data files | Â | Â | |
(CMS PUF) | Free | Has information about hospital performance metrics | Hospital Compare Results will be used to describe hospital performance metrics, such as overall quality of care ratings, patient-reported quality of communication with providers, risk-adjusted 30-day readmission and 30-day mortality rates for specific conditions, complication rates, and outcome metrics. | Linked via hospital provider ID (PROVIDER) in the claims files |
| |
Free | Lots of area-level information for different geographic aggregations, including information on:
| HRSA Area Health Resources Files will be used to ascertain area-level information for geographic aggregations including, but not limited to, data about health facilities, health professions, measures of resource scarcity, health status, economic activity, health training programs, and socioeconomic and environmental characteristics. | Linked via patient or provider ZIP code, county, or state in the claims data files | |||
NC Health Professions Data System (HPDS) (a.k.a. Health Workforce NC data) | Maybe | Descriptive data for selected licensed NC providers from 1979 - present. Includes provider-level race/ethnicity variables | The HPDS data will be used to obtain taxonomy (specialty) information and other health care provider characteristics. | Linked via provider NPI Most useful for linkage to NC Medicaid claims. | Contact nchealthworkforce@unc.edu to obtain additional data. | Â |
NC Office of State Budget and Managements’ rural/urban classification | Free | NC DHHS’ preferred rural/urban definition for NC Medicaid analysis Note: As of January 2022, NC DHHS is considering transitioning to another rural/urban classification scheme. | We may link publicly available, area-level data to the claims files using enrollees’ county of residence and providers’ county of practice. Potentially public-use data files include NC Office of State Budget and Managements’ rural/urban definition or other sources. | Linked via patient or provider county in claims | Crosswalk (download) |  |
(CMS PUF) | Free | Almost exclusively for provider taxonomy (specialty) information. Can also provide physician zip code (though PROVZIP on carrier files or the provider ZIP from the POS file for OP/IP files is preferred) Also request UPIN if using pre-2007 CMS data. | The NPI/NPPSE Registry will be used to obtain taxonomy (specialty) information and other health care provider characteristics. | Linked to physician NPI IDs (e.g., PRF_NPI, OP_NPI) in the claims data files. | ||
Provider of Service (POS) File2 (CMS PUF) | Free | Has lots of information regarding hospitals including but not limited to:
| The Provider of Service File will be used to obtain information about healthcare facilities including, but not limited to, geography, ownership characteristics, affiliated services, off-site services, bed size, procedure room characteristics, and affiliations with medical schools. | Linked via facility provider ID (PROVIDER) in the claims files | Methodology and data dictionaries are on the download pages. Â | |
(CMS PUF) | Free | Has provider attributes and policy rates set annually by CMS for the Prospective Payment System. It covers inpatient, SNF, home health agency, hospice, inpatient rehab, long-term care and inpatient psychiatric facility providers. Includes:
| Â | Â | ||
Free | A better classification of urban/rural status than the binary information available in other files (10 classifications). Takes actual commuting patterns into account when defining rurality, e.g., distinguishes between a rural area where most people are commuting into a large metro area daily and a rural area that is isolated and does not have many people commuting into larger metro areas. | RUCA Code files will be used to classify the urban or rural statuses of geographic areas. | Linked via patient or provider ZIP codes in the claims files |  |  | |
Free | Rural/urban classification that primarily derives rurality from population density and geographic proximity to metro areas (9 classifications). | RUCC files will be used to classify the urban or rural statuses of geographic areas. | Linked via patient or provider county | Â | ||
Free | Database that aggregates variables from 47 data assets to provide social, economic, educational, physical infrastructure, and healthcare characteristics. It includes data from:
| The AHRQ Social Determinants of Health Database will be used to describe the social characteristics (e.g., demographics, veteran status, socioeconomic disadvantage), economic status (e.g., income, unemployment rate, poverty), educational characteristics (e.g., attainment, literacy), physical infrastructure (e.g, housing, crime, transportation), and healthcare context (e.g., provider characteristics, measures of resource scarcity, healthcare quality) of geographic areas | Linked via patient, provider, or facility counties, ZIP codes, and census tracts in the claims files Note: Census tract-level files are not available in the VRDC library. |
| ||
Free | Has detailed information for many biomedical vocabularies. We use it to provide code descriptions for things like:
| The UMLS Metathesaurus will be used to provide code descriptions (i.e. for ICD-9-CM diagnosis and procedure codes, ICD-10-CM diagnosis codes, ICD-10-PCS procedure codes, CPT/HCPCS procedure codes, and BETOS codes) and mapping ICD-9 to ICD-10 codes. | Descriptions are linked to the diagnosis and procedure codes in the claims data. | Â | ||
Free | As with the NPI Registry, this is almost exclusively used for specialty information. Only relevant for pre-2007 data. | The UPIN Directory will be used to obtain taxonomy (specialty) information for physicians associated with claims prior to 2007. | Linked to physician UPIN (e.g., PRF_UPIN, OP_UPIN) in the claims data files. | VRDC: 2003 - 2007 are in the PROVIDER library; CMS must include the following files and EPPE codes on the DUA’s approved file list8:
| ||
Free | Geographic level (state, metropolitan area, zip code, etc.) aggregated demographic, economic, and SES information drawn from the US Census and the American Community Survey. Data includes many averaged metrics including but not limited to:
| US Census data will be used to obtain geographic-level aggregated demographic, economic, and socio-economic information. | Linked via patient or provider state or ZIP codes in the claims files using Census zip code tabulation areas Sometimes we first link Medicare data to another source (like GWTG or DEDUCE) that may have more detailed geographic information. In the case of DEDUCE, we have geocoded address, which we can link to Census data at various levels of geography, down to the census block or tract. | Â |
1Â Cost to projects that pay the annual DataShare infrastructure fee
2Â If all that is needed from this file are things like bed size and teaching hospital status, the Provider of Service file may be a suitable substitute. The American Hospital Association file may have more detailed information about hospital systems and services available at affiliated hospitals, however.
3Â Approved by ORC for upload into the VRDC.Â
4 Available directly from CMS in the VRDC. Request access in the free text portion specification document and/or request access directly from GDIT following approval.
5 RUCA and RUCC classifications for a patient’s residence at the time of diagnosis are included in SEER-Medicare data.
6 Include this non-CMS file in all CMS DUA applications. Available on Oracle in PACE as five tables with the REFLIB schema: CPT_HCPCS, ICD10CM, ICD10PCS, ICD9CM_DX, ICD9CM_PX; the GEMS and other files are from other sources. Michael Stagner can provide the five SAS dataset files for upload into the VRDC.
7 Submit a request to the CCW help desk with an approved DUA Attachment A that includes the file in the linked files list.
8 Request the 100% extracts and needed data-years for the UPIN files on the DUA specification worksheet’s Annual Extract Summary\Miscellaneous\Other section. If necessary, this can be done via correspondence with ResDAC after CMS approves the linkage.