Understanding the pipeline for publishing data to the Books & Media Catalog

A note of caution about this page - library staff are continuing active work to improve the pipeline, so parts of the workflow may be updated before the documentation can be updated. When in doubt about something you see in Books & Media, Summon or elsewhere, please submit a ticket to the Alma group at https://support.lib.duke.edu.

Part 1: Alma publishes three sets of data

ย 

ย 

alma-publishes-three-sets.png
Part 1 diagram showing Alma publishing profiles

We use Almaโ€™s publishing profile functionality to include specific MARC and other fields in published data, allowing us to use that information further on in the pipeline.

In the OAI-PMH publishing profile (step 2a), Alma tells us what type of resource a record is, which determines how the record is enriched as the pipeline proceeds.

In addition to the MARC fields, Alma publishes:

  • Physical record information in one or more 940 fields

  • Electronic record information in one or more 943 fields

  • Collection information in a 944 field (note - collection records published currently donโ€™t have all the needed information, and the integrations team is working to fix this)

That means the pipeline can tell what type of resource a record contains based off of what local 9xx fields are present. Library staff donโ€™t see those fields in records - theyโ€™re added specifically to support the publishing pipeline.

We also run a job using the Primo publishing profile (step 2b.) We run this job because the OAI-PMH publishing profile does not include item availability information, which we need for the Books & Media catalog. The Primo publishing profile gives us specific record IDs that we can use later in the pipeline.

And finally, Alma publishes records to Summon (Step 2c). Though we primarily direct patrons to Summon for articles, all of our electronic and print resources are published to Summon and can be viewed with appropriate content types checked or unchecked. Because Summon is offered by Ex Libris, we use Almaโ€™s integrated publishing profile to send data, and no further pipeline work is needed.

Specific field mappings in the OAI-PMH publishing profile can be found in the Digital Strategies & Technology wiki here: https://duldev.atlassian.net/wiki/spaces/DSTP/pages/3870195734/TRLN+Discovery+Publishing+Profile+Enrichment . Note that that wiki has limited access - if you canโ€™t view the page, submit a ticket to the Alma queue at https://support.lib.duke.edu and a staff member will send you the information.


Part 2: Custom Duke scripts begin enriching the Alma data with additional information

image-20250220-135755.png
Part 2 diagram showing data processing

ย 

First, we retrieve the records published by the Primo publishing job, extract the record IDs, and keep the information in dedicated storage (labeled in the diagram as 3). This ID information is needed so we can retrieve item availability, which we do not get from the OAI-PMH publishing profile (2a, above). This process must complete before the pipeline can continue.

Then we retrieve the records published by the OAI-PMH publishing profile and store them on a local server (4a). These files are in MARCXML format.

We then run a script that processes the XML files from OAI-PMH (labeled in the diagram as steps 4b through 5f.)

  • If a record is marked suppressed, or deleted, we log the MMSID for the record. We will send the ID to TRLN Discovery for deletion later in the pipeline

  • If a record contains physical inventory, we query the dedicated database of IDs from Primo for the record ID, which we then use to query Summon for the item availability. We add it as enrichment data to the MARCXML file (step 5d)

  • If the record does not have any inventory (step 5e), then we send the record for deletion later in the pipeline.


Part 3: Duke scripts complete the data transformation and send the records to the Books & Media Catalog

ย 

ย 

ย 

image-20250218-151540.png
Part 3 diagram showing final publishing steps to the Books & Media Catalog

We copy all of the enriched XML files that we created into a specific directory to be prepared for the Books & Media Catalog (step 6).

We then run a program that transforms the enriched XML to JSON format following Argot, TRLN Discoveryโ€™s shared ingest format. You may hear developers refer to this program as MARC-to-Argot. (If youโ€™d like to learn more about Argot, thereโ€™s documentation here: data-documentation/argot/README.adoc at main ยท trln/data-documentation )

MARC-to-Argot is where we do the majority of record transformation to make sure the MARCXML field value goes in the right field to show in the right place in the Books & Media Catalog interface. For example, MARC-to-argot is where we:

  • Map the record call number into the right call number facet - for example, ensuring that a record with call number BF411 .I343 appears nested within the BF309 - BF499 call number facet

  • Create the urls for electronic resources, based off of 943 enrichment data

  • Determine what names to display under โ€œAuthors, etc.โ€ on a Books & Media Catalog bibliographic display page

  • Map edition information to display if present in the MARC 250, 251 or 254

  • Map variant title information if the MARCXML has 210, 222, 246 or 247 fields

  • Map information like the library, location, and barcode for physical holdings and items

  • Normalize call numbers and store them on the record to support searching

  • Add donor information to the local note for display

And finally, to finish the pipeline in step 8, we send the processed records in batches to TRLN Discoveryโ€™s shared index, which the Books & Media Catalog uses as its shared backend to provide records to patrons.

ย 

Related content