TRLN Discovery/POD Integration Plan

 Purpose

This document defines the requirements to change the data source for current integrations to TRLN Discovery and POD from Aleph to Alma.  TRLN is the Triangle Research Libraries Network and POD is the Platform for Open Discovery through IvyPlus. POD data supports both BorrowDirect and TRLN borrowing with ReShare.

Alma to TRLN/POD Data Pipeline Diagram

 Overview and Reference Links

Diagrams describing data storage to support this and other Alma integrations

(CI Pipline = Continuous Integration Pipeline)


TRLN Architecture Overview Diagram

This diagram was provided by TRLN staff and describes the data architecture from a TRLN perspective.


For historical reference, the diagram below describes the data pipeline architecture used from Aleph to TRLN and POD.  

TRLN Discovery OPAC

This wiki page documents the current Aleph to TRLN Discovery integration: TRLN Discovery OPAC

Customizations

more customizations https://github.com/trln/marc-to-argot/tree/main/lib/data/duke


Aleph-based Pipeline
add_transfer_url="https://trln-discovery-ingest.cloud.duke.edu/add_update_ingests"
delete_transfer_url="https://trln-discovery-ingest.cloud.duke.edu/delete_ingests"
pod_transfer_url="https://pod.stanford.edu/organizations/duke/uploads"
Source: https://gitlab.oit.duke.edu/aleph/home/-/blob/23_prod/home-slashaleph/opac/call_extract_scripts.sh

Metadata Management Imp Team documents

POD code in Github

https://github.com/pod4lib/aggregator

 Data Extract Criteria
This section defines which records should be extracted from Alma for this integration.  
  • Extract instance, holding, and item records from Alma that aren't suppressed from discovery and aren't marked with a "delete" tag on the record.
  • Extract records that are newly suppressed and newly deleted so that they can be removed from TRLN Discovery.

See the Alma Configuration Dependencies section below for detailed integration profile documentation.

 Frequency
  • The Aleph update-only extract ran every 30 minutes. We expect the Alma updates-only extract to run every hour; this is a configuration controlled by Ex Libris and we cannot make it run more often.
  • If no records are identified during the current Aleph harvest, a 0 record file is generated, so that practice should be continued for Alma.

  • Historically, a set of 5000 records is extracted every hour on the 45th minute so that eventually the entire set of inventory records in Alma will be sent to downstream systems. We do not yet know if this is needed for Alma; we have not attempted to replicate this workflow yet and won't do so  until we wknow that it's needed.

 Data Volume
  • Full extract is approx 7.15 million records
 Alma Configuration Dependencies

This section describes the Alma configurations required to support this integration.

OAI-PMH

An Alma integration profile is used to define the specifications for publishing records via OAI-PMH.

Documentation is in progress.

Profile details

  • Name: OAI-PMH for TRLN Discovery
  • Profile description: OAI-PMH for TRLN Discovery
  • Publishing Parameters:
    • Status: Active
    • Scheduling: Every 1 hour(s)
    • Email notifications: no address
  • Content:
    • Set name: Entire repository default set
    • Additional set name: (blank)
    • Publish the entire repository: checked
    • Filter out the data using: (blank)
    • Publish on: Bibliographic level
    • Output format: MARC21 Bibliographic
  • Publishing protocol
    • FTP: (not checked)
    • OAI: checked
    • Set Spec: trln_discovery_spec
    • Set Name: trln_discovery_name
    • Metadata prefix: marc21
    • Z39.50: (not checked)

Data Enrichment

  • Bibliographic normalization
    • Correct the data using normalization processes: Delete tags for TRLN extract

      (needs documentation)

    • Linked data enrichment: unchecked
  • Bibliographic Enrichment
    • Add management information: checked
    • Repeatable field: 942
    • Publish suppressed records as deleted: checked
    • all other fields are empty
  • Related Records Enrichment
    • Add related records information: checked
    • Relation type field:  TYPE
    • Relation record MMS ID subfield: a
    • Relation type subfield: 8
    • Related fields enrichment:
      • Related tag: 245, related subfield: a, Bib tag: RELATED, Bib subfield: a, relation type All
    • Add holdings/items of related records: checked
  • Authority enrichment
    • Add authority information: unchecked
  • Physical Holdings Enrichment
    • Add holdings information: checked
Holdings tagHoldings subfieldBib tagBib subfield
852h852h
852i852i
852j852j
852k852k
852l852i
852m852m
866a866a
866z866z
867a867a
868a868a
852b852b
852c852c
852q852q
  • Exclude suppressed records: (checked)


  • Physical items enrichment
    • Add items information: checked
    • Repeatable field: 940
Item PID subfieldk
Barcode subfieldp
Item Policy subfieldo
Description subfieldn
Current library subfield

b

Current location subfieldc
Call number type subfieldd
Call number subfieldh
Public note subfieldz
Create date subfields
Update date subfieldu
Holdings ID subfielde


  • Electronic Inventory Enrichment
    • Planned implementation as of August 20 2024 in the 943 
Portfolio PID8
URL Type Subfielda
Access URL subfieldu
Link Resolver Base URLe
Static URLh
Electronic Material Type subfieldq
Proxy Select subfieldb
Proxy Enabled subfieldc
Authentication Note subfieldx
Public Note subfieldy
Direct Link subfieldd
Service IDg
  • Digital Inventory Enrichment
    • Add Digital Representation Information: Unchecked
    • Add File Information: Unchecked
    • Add Remote Representation Information: Unchecked
  • Collection Enrichment
    • Add Collection Information: Unchecked


 Data Mapping

This section describes any data mapping requirements for this integration. 

Alma publishing creates MARCXML records and the downstream ingest processes for TRLN and POD map data to the discovery layer and POD. The work required to modify the current Aleph-sourced process to an Alma-sourced process may not require new data mappings.  For reference, here are links to the data mappings already in use in Production with Aleph. It is assumed that the new integration with Alma will continue to use these mappings:

Mapping Tables

Additionally, this is a working spreadsheet to track any mapping changes and example records for validation: https://duke.box.com/s/3rrlpx2soe54ir587ij3j1st3wh9b3st

Note: needs to be updated for Alma

Mapping spreadsheet created 3/29/23 to track MARC to TRLN mapping from Aleph and FOLIO: https://duke.box.com/s/uyebv605j1lma0ex90csrpg30qlg6gys

 Data Transformation Rules

Aleph to TRLN Transformations

In the data pipeline diagram below, the gray oval representing "PERL bib data extract scripts" performs several data processes:
  1. Identify which bib records should be extracted from Aleph.
  2. Extract MARC items and holdings records for those bib records.
  3. Extract additional item data from Aleph that isn't included in MARC for each bib ("enhanced item data").
  4. Create a file containing records in MARCXML format.

The following table describes where each data processing task occurs in the current Aleph data pipeline so that the team can validate that each task is covered in the new Alma-sourced data pipeline to TRLN. 

  • The rows contain the list of data processing tasks that are occurring against Aleph data (grey column).
  • The blue columns identify where each process will be handled in the new Alma data pipeline.
#Data processing in the existing Aleph to TRLN data pipelineNew Alma data pipeline - where will each process be handled?
No action requiredMigration of Aleph records to AlmaAlma Integration ProfileAlma to TRLN scriptsTRLN Discovery AppOther/notes
1Non-unicode character cleanup




2

Aleph records are excluded from the current TRLN/POD extract based on various criteria such as:

  • Item Process Status (IPS) = DEL or STA
  • has no items and is not electronic
  • non-electronic bib with only withdrawn items

NEED TO REVIEW THIS (Alma doesn't support suppress items)





These records are loaded to Alma and tagged with "suppressed from discovery" flag so that they'll be excluded from Alma publishing.  Note from Julie - confirm that this is true for Alma

3Drop specified local use fields such as the 029 - see notes for link to full list



The data elements listed in the Aleph Alphabetic and 9xx Fields.xlsx spreadsheet as "Do not migrate" are not migrated on records loaded to Alma. confirm that this is true for Alma

To do:  Review the current Aleph to TRLN scripts to confirm that everything that is currently dropped is covered on this spreadsheet to confirm how that was handled in the Alma migration

4

Identify which bib records should be extracted: Updates-only during 30 min window, skip "suppressed records"





Follow-up with Matt/Jeff/Ayse to see if TRLN is using the same criteria as current MADS criteria for determining which records are pulled.
5Extract MARC items and holdings records for each bib record that meets the extract criteria




6Extract additional item data that isn't included in MARC for each bib, such as location and availability




7Extract an additional 5000 records every hour so that the entire dataset is eventually refreshed in downstream systems incrementally.



This step enables the entire dataset to be refreshed incrementally.

This was originally done as a compensatory maintenance measure for Endeca. We're not sure if it will need to be done for Alma, but are tracking it here for verification either way.

8Convert extracted data to MARCXML format





"New" data processing for Alma data (these processes weren't needed for Aleph data)No action requiredMigration of Aleph records to AlmaAlma Integration ProfileAlma to TRLN scriptsTRLN Discovery AppOther/notes
9Declare namespace at the beginning of MARC XML output, needed for Marc-to-Argot




10Specify record type as Bibliographic in XML




11Remove marc namespace prefixes - this was needed for FOLIO, is it needed for Alma?




12Remove unnecessary metadata node to XML




13Add collection node to XML





End of "New" data processing for Alma data No action requiredMigration of Aleph records to AlmaAlma Integration ProfileAlma to TRLN scriptsTRLN Discovery AppOther/notes
14Run Marc-to-Argot based off of Duke specific overrides



Does this need to be updated for Alma?  The MARC 001 field control number needs to be transformed with string 'DUKE' as prefix and first 4 digits stripped, so a FOLIO 001 containing 'in00009142400' needs tranformation to 'DUKE009142400'.  Currently handled in the marc to argot scripts (added 3/29/23)
15Move data via Spofford app






Enhanced item data

The following data elements are extracted from Aleph to enhance what is included in MARC

Aleph Oracle TableData elementDescription (from Aleph documentation)
Z30sublibraryCode of the sublibrary that
“owns” the item.
Z30collectionCollection code of the item 
Z30call_no_typeThis defines the type of location
assigned in the Z30-CALL-NO field
Z30call_noShelving location of the item
Z30descriptionDescription of the item (for
multi-volume monographs or serial
items) to help the user identify the item
they are interested in. For items created
through the Serials function, this field is
automatically set to the enumeration and
chronology description of the issue.
Z30note_opacNote field. This note is displayed
in the OPAC
Z30item_statusStatus of the copy for loan
purposes.
Z30item_process_statusDefines the item's
processing status (e.g. on order,
cancelled, binding, etc.).
Z30materialMaterial type. This can be
VIDEO, BOOK, ISSUE, etc. "ISSUE"
has special functionality within the
system, all other types are used in an
equal manner.
Z30barcodeUnique identifier of the item.
Z30open_dateCreation date of the copy.
Z30update_dateDate copy was last updated
Z30date_last_returnDate copy was last updated
Z30no_loansNumber of times the item was
loaned. 
Z30hol_doc_number_xSystem number of the holdings
record to which the item is attached. 
Z30temp_locationThis is a toggle field used to
control the overriding of the location
fields by the holdings record. Values are
Y and N. 
Z36due_dateActive due date (see also
ORIGINAL-DUE-DATE and RECALLDUE-DATE). This field changes
according to loan transactions (e.g. renew
loan).
Z103lkr_typeDefines the type of link.
ADM = Link from ADM record to BIB
record. Link is build from BIB to ADM.
ITM = Link between a BIB record and
the items of another BIB record. 
Z37statusHold request status.
Z37end_request_dateLast date of interest for the
hold request. 
Z36idPatron‟s ID
Z36recall_due_dateDue date computed as a result of
the recall transaction. The date might
actually be later than the due date, if the
recall was generated close to the end of
the loan period. It is retained for
computing fine owing for late return of
recalled item. 
Z30rec_key 1,9System number of the
administrative record (ADM) associated
to the item.



Need to update for Alma?

FOLIO to TRLN Transformations

LOC Local/Obsolete Data Elements Present in Harvest
  • 261
  • 262
  • 400
  • 410
  • 440
Source: https://www.loc.gov/marc/bibliographic/bdapndxh.html
Local Fields Present in Harvest
  • 049, 092
  • 690, 692, 693, 694, 695, 697, 698
  • 901, 902, 903, 904, 905, 909
  • 910, 914, 915, 916
  • 935, 938
  • 943, 948, 949
  • 951, 952, 955, 956
  • 980, 981, 987
  • 994, 996, 998, 999


To Do: Validate for Alma FOLIO includes the following location and call number information in the OAI-PMH harvest using this mapping:

FOLIO Inventory app fieldMARCAdditional rules
Effective location: Institution952$a
Effective location: Campus952$b
Effective location: Library952$c
Effective location: Name952$d
Effective call number components: call number952$e
Effective call number components: prefix952$f
Effective call number components: suffix952$g
Effective call number components: type952$h
Material type952$i
Volume952$j
Enumeration952$k
Chronology952$l
Barcode952$m
Copy number952$n
Electronic access: URI856$uCreate separate datafield for each URL. Use URLs from each record level (holding and item)
Electronic access: Link text856$yCreate separate datafield for each URL. Use URLs from each record level (holding and item)
Electronic access: Materials specified856$3Create separate datafield for each URL. Use URLs from each record level (holding and item)
Electronic access: Public note856$zCreate separate datafield for each URL. Use URLs from each record level (holding and item)
Electronic access: Relationship: No display constant generated856 2nd indicator 8 1st indicator 4
Electronic access: Relationship: No information provided856 2nd indicator blank 1st indicator 4Such indicator filling in works also for empty "Relationship" value
Electronic access: Relationship: Related resource856 2nd indicator 2 1st indicator 4
Electronic access: Relationship: Resource856 2nd indicator 0 1st indicator 4
Electronic access: Relationship: Version of resource856 2nd indicator 1 1st indicator 4
Electronic access: Relationship: empty value856 2nd indicator empty 1st indicator 4




Transformations from FOLIO output to match TRLN input


Transformation from the command line:
ruby -pi.bak -e "gsub(/<metadata>/, '')" file.xml # this finds a string and replaces it
ruby -pi.bak -e "gsub(/<\/metadata>/, '')" file.xml
sed -i -f - file.xml < <(sed 's/^/1i/' xml_header.txt) # insert the contents of the file at the front
ruby -pi.bak -e "gsub(/marc:/, '')" file.xml
ruby -pi.bak -e "gsub(/<record>/, '<record type=\"Bibliographic\">')" file.xml
echo -n '</collection>' >> file.xml # insert at the end of the file
Contents of xml_header.txt
<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">



Argot Configuration

Defaults https://github.com/trln/marc-to-argot/tree/main/lib/data/argot

List of Duke-specific Overrides

These are attributes that Duke does not want to be parsed by the default argot configuration: https://github.com/trln/marc-to-argot/blob/main/lib/data/duke/overrides.yml

id
local_id
rollup_id
oclc_number
institution
items
holdings
names
url
access_type
date_cataloged
primary_oclc
physical_media



Mapping Holdings Summaries from Alma to TRLN Discovery

Aleph output the holdings summary to the 852$a; this is wrong in Alma, which outputs the institution code to the 852$a

For Alma, we map the 866$a and 866$z to the holding summary for display, with handling for repeatable fields.

We do not show the 866$x if present.



Inventory of MARC fields with indication of where/how each value is set (ex: yml file, macro)

To see full list, open this link to the file in Box: https://duke.box.com/s/a98wrly961091bxnj5bj3ysnslrvqo15


Record identifiers in Books & Media - Aleph-born records and Alma-born record

When migrating records from Aleph to Alma, the Aleph bib ID is transformed to become the Alma MMS ID. See Ex Libris record number documentation: link

The format of the migrated record MMS ID is 99 + Aleph bib ID + 010 + 8501

  • 99 indicates the type of record (bibliographic)
  • 010 is a bibliographic library identifier added by the Ex Libris migration process
  • 8501 is our Alma institutional ID

For an Alma-born record, the format of the migrated MMS ID is 99 + record identifier + 8501

  • 99 indicates the type of record (bibliographic)
  • 8501 is our Alma institutional ID
  • The record identifier is a unique identifier for the record.

Books & Media URLs (find.library.duke.edu...)

  • For Aleph born records, we strip '99' and '0108501' so that the URL does not change at Alma cutover
    • E.g., for a record with Aleph bib "006288172", the Alma MMS ID is "990062881720108501", and the Books & Media URL is https://find.library.duke.edu/catalog/DUKE006288172
  • For Alma born records, the MMS ID is generated when the record is created in Alma, and it becomes the identifier in Books & Media
    • E.g., a new title is ordered in Alma and it is assigned an MMS ID of "99112812262508501". The Books & Media URL is https://find.library.duke.edu/catalog/DUKE99112812262508501


Default availability value for Rubenstein and University Archives

Because of the process we are using with Alma to obtain availability data with Summon, we have run into issues disambiguating some Rubenstein and University Archives holdings so that we know which holding record in the OAI output corresponds to which Summon record.

Because we do not have a straightforward way to disambiguate (Summon APIs do not give us collection codes,) we will default to having Rubenstein and University Archives show as available in the underlying data sent to Books & Media. When they appear in search results, the circ status API will check their availability in real time and update the display to a different  value if there is one.

 Data Validation - Testing Plan

The purpose of this section is to describe the validation steps needed to confirm that this integration is working successfully. Since this integration replaces the data source from Aleph to Alma, we expect the end of the data pipeline to mirror what is currently flowing from Aleph.  We want to confirm that the switch from Aleph to Alma doesn't impact how data is displayed in the TRLN catalog, POD, or Summon.  

  1. Is the expected volume of records moving from Alma to TRLN and POD? 
  2. For various record types, is the data from Alma displaying in the TRLN catalog as expected? 
  3. For various record types, is the data from Alma mapping to POD as expected?

The following types of records should be reviewed during testing. Links to specific records of each type are maintained here: https://duke.box.com/s/3rrlpx2soe54ir587ij3j1st3wh9b3st

Type of RecordFields of Interest General Fields of Interest
book
650, 901, 904, 905, 951, 952, 998
bound-withs987, 943
e-books914
"funky" serials record

Lok Sabha debates952 (145 times)
map

microfiche

microfilm

musical score348, 655, 914
physical media (record)700
readers digest record952 (205 times)
rubenstein



The following library staff should be included in validation efforts once a testing infrastructure is ready to help confirm that data sourced from Alma displays in the TRLN catalog as expected:

NameTitleLibrary
Andy ArmacostHead of Collection Development and Curator of CollectionsRubenstein
Sean ChenHead of Cataloging and Metadata ServicesGoodson Law Library
Bethany CostelloResource Access LibrarianFord Library
Neal FricksContent and Discovery SpecialistMedical Center Library
Jessica JaneckiTeam Lead for Original CatalogingDUL Collections Services
Ryan Johnson (for music cataloging)Special Formats Description LibrarianDUL Collections Services
Meghan LyonHead of Technical ServicesRubenstein 
Erin NettifeeIT Business AnalystDST - LSIS
Lauren Reno (for single item cataloging)Section Head, Rare Materials CatalogingRubenstein
Jacquie SamplesHead, Metadata & Discovery StrategyDUL Collections Services
 Spofford Ingests
Transfer URLs
add_transfer_url="https://trln-discovery-ingest.cloud.duke.edu/add_update_ingests"
delete_transfer_url="https://trln-discovery-ingest.cloud.duke.edu/delete_ingests"
pod_transfer_url="https://pod.stanford.edu/organizations/duke/uploads"

Source: https://gitlab.oit.duke.edu/aleph/home/-/blob/23_prod/home-slash-aleph/opac/call_extract_scripts.sh
The above locations would need to be changed for testing and using a test server.

 Assumptions


 Questions

Open

  1. How are configuration changes communicated to POD (TRLN has YAML files - does POD have something similar)?
  2. Are full loads ever harvested or is this always an updates-only process?

Resolved 

  1. We are currently sending Summon MARC, do we know if Summon will accept MARC-XML? Per Jacquie Samples on 3/23 - we will need to transform to MARC for Summon
  2. Does Summon need availability information?  Per Jacquie Samples on 3/23, Yes - RTAC data will need to be included for Summon.
 Development JIRA(s)

Duke DST Alma Integration Project Jiras for Epic "TRLN Discovery/POD": https://duldev.atlassian.net/jira/software/c/projects/AI/boards/87/backlog?issueParent=35040

 Change Log

For documents that have been approved and are in a finalized state (and locked for edits), include this change log to keep track of future changes. 

DateDescription of ChangesUpdated By
1/30/2024Page created.J. Brannon
4/8/2024Replaced the FOLIO data pipeline diagram with the proposed Alma to TRLN/POD diagramJ. Brannon
4/8/2024Removed "Summon" references since that feed will be handled in Alma through a separate process, not as part of this data pipeline.J. Brannon
6/4/24Updated the new TRLN/POD data pipeline diagramJ. Brannon
6/11/24

Updated data transformation rule #7 to indicate it might not be needed (incremental extract of 5,000 records for background refresh.) Added section "Record identifiers in Books & Media - Aleph-born records and Alma-born records" to document decision on Books & Media identifier formats.

E. Nettifee
6/28/24Add information about mapping holdings summaries as this changes from Aleph to Alma.E. Nettifee
7/26/24Added diagrams about data storage to the overview sectionJ. Brannon
8/20/2024Copied in planned imp of 943 e-resource enrichment from Teams chat (into the Configuration Dependencies section)E. Nettifee
10/4/2024Moved data pipeline diagram out of expand section and made minor updates to Overview and Reference Link sectionJ. Brannon