Updates related to DUL Discovery/Catalog Functionality

October 9, 2024

The records from 9/18 are all fully ingested into the common data stream, so the catalog data should now be up to date as of 9/18.

Another fun change that came with this update: you can now search by MMS ID in the Books & Media Catalog! Searching for the MMS ID (the new Alma ID number) should work either for “all fields” searches or for “ISBN/ISSN/barcode” searches.

 

Screenshot 2024-10-09 at 3.53.26 PM.png
Example of searching for MMS ID using the “ISBN/ISSN/barcode” search scope.

October 7, 2024

Quick update on the data pipeline. The records harvested from Alma on 9/18/24 have just about finished their load into the TRLN common data stream. Many updates are already available in the catalog, like improvements to the New Titles results. In other cases, the updated data may just be the first step toward getting the catalog to work correctly.

We have also just published a document detailing . This new documentation goes into a bit more detail about the workarounds for several of the major issues still affected the Books & Media Catalog. Spoiler alert: searching via Summon is often a good alternative.

Be on the lookout for invitations to staff listening sessions related to the catalog! Thomas Crichlow will be hosting a series of meetings to hear about issues people are experiencing and offer additional support.

October 3, 2024

This update is a bit of a state-of-the-union for the Books and Media Catalog extended universe.

When something doesn’t look or work quite right in the Books and Media Catalog, the true issue could lie in a variety of places. The following summarizes recent and upcoming work related to: the underlying data, the data pipeline, the Books and Media Catalog itself, and the Catalog Request System.

Underlying Data

Context

Some of the issues related to what is showing up in the catalog can be traced back to issues with the underlying data.

Recent updates to the underlying data

  • Consolidated duplicate e-resource records (migrated from both Aleph and 360KB)

  • Collection code table in Caiasoft has been updated to allow RL to send items to LSC for accessioning

  • Adjusted the work order for Lilly items housed in Perkins so they can be requested

  • Lilly materials at remote storage put into a separate Lilly Renovation work order to facilitate identifying needed record updates (aka cleanup for Lilly items in Perkins stacks, non-Lilly materials sent to Clancy due to mis-shelving, etc.)

  • Fixed an issue where Lilly DVDs requested by staff were being recalled by the Lilly Renovation work order

  • We’ve stripped out Library of Congress Table of Contents links from print materials in Alma so they don’t end up in the catalog with a confusing “View Online” button. Regular analysis will be undertaken to remove any future LC TOCs that slip in.

Upcoming underlying data work

  • Continuing to consolidate duplicate e-resource records

  • Some DKU items and holdings remain in Alma with invalid permanent locations, so they’re still showing up in the catalog.

  • Ongoing work on data underlying bound-withs to prevent requesting errors

  • Address issue with Withdrawn items showing up in the catalog by moving them to a separate library in Alma

    • Withdrawn Library appears in catalog, but should be suppressed from display to prevent patron requests for materials no longer held

  • Address issue with Lost items showing up in the catalog by moving them to a separate library in Alma

    • Lost Library appears in catalog, but should be suppressed from display to prevent patron requests for materials no longer held

  • Items with “technical migration” status type are undergoing review to determine what Alma status type is correct. These include items that had an Item Process Status, IPS, in Aleph that was not easily mapped to Alma. IPS in this item state include those for missing, long missing, withdrawn, lost, on order, arrived, cancelled order, suppressed, etc.

Data Pipeline

Context

The data used for the catalog comes from Alma, but it takes a lot of transformation to get everything in the right format for the catalog. (Because we share our catalog and its data stream with our TRLN partner institutions, we all have to conform to an agreed-upon standard data structure.)

Our data pipeline from Alma to TRLN Discovery includes a complicated extraction process from Alma. The steps (for a full export and publish cycle) are roughly:

  1. Setup a publishing profile in Alma to determine what metadata elements are included when we harvest records

  2. Kick off a publishing job in Alma

  3. Harvest all of the published records from Alma

  4. Enrich the records with additional information not available from the harvest (e.g., availability information)

  5. Transform the final records into the correct format for TRLN Discovery

  6. When needed, test a subset of records in a Duke sandbox to see how they will look in the catalog before we push them into the live application

  7. Publish the records into the TLRN Discovery data stream

A note about the stale data in the catalog: When we have to run the full pipeline on all ~8 million records that go into the catalog, the process can take almost a month to complete. Since the Alma cutover in July, we’ve had to make several changes to the pipeline, and each change meant we had to reprocess all of the records again. As the pipeline nears its final state, we are working to transition to “intermittent updates.” This means that instead of reprocessing all of our millions of records each time, we can just ask Alma for the records that have been updated since our last run. Switching to intermittent updates takes some extra coding to automate each part of the pipeline, and that work is currently underway. When we get to the point where we can switch to intermittent updates, we hope to be processing updated records every hour. In the meantime, changes in Alma will still take several weeks to appear in the catalog.

Recent updates to the data pipeline

  • Latest data run:

    • On 9/18 we began another publishing job from Alma (step 2 above), so the next update to the data in the catalog will include changes up to the morning of September 18

    • As of 9/25, we expect the enrichment of the records (step 5 above) to be completed by 9/30 or so. It will likely take another week (around 10/7) before we see the data from 9/18 in the catalog.

  • One of the data quality issues in the catalog right now involves some missing data for electronic records, and as a result the records do not show the URL to access the resource. The needed fields have now been included in the publishing profile and the data transformation scripts have been updated to get those fields into the catalog data. When the 9/18 data goes live in October, we will see the correct access URLs for those electronic records.

  • Another data quality issue in the catalog right now is that print records were not getting a value for the “date-cataloged” field, which is what our catalog uses to display new titles. Several widgets used throughout the library website rely on this data as well. With some new fields in the publishing profile and new logic in the transformation scripts, this issue should be resolved with the next full data refresh in October.

  • We also recently fixed an issue with our availability information so that it will show up correctly on TRLN partner pages and in the availability filter.

  • We’ve been continuing to improve the publishing profile to include the fields needed for discovery.

  • We’ve removed records that shouldn’t show up in the discovery layer (e.g., a large batch of deleted and suppressed records that accidentally got included in the catalog)

  • Modified enrichment code to request records in batches of 50, instead of one at a time.

Upcoming data pipeline work

  • The scripts that complete the various pipeline steps need to be configured to run on a schedule so that we can perform the entire pipeline in an automated fashion every hour or so. So far, we are ready to run the harvesting step on a schedule, but we need to complete the work for scheduling the enrichment, transformation, and publishing steps.

  • Explore modifying the data transformation process to build data that works better for the way the catalog expects temporary locations to work

  • Automate publishing our records to the Platform for Open Discovery (POD).

  • Trigger updates to the pipeline and catalog mappings when new libraries/locations are added in Alma

  • Create additional documentation about data flow from Alma to the catalog

  • If a previously published record becomes suppressed or deleted in Alma, create a separate process for deleting it from the catalog

Books and Media Catalog

Context

The Books and Media Catalog we host at Duke is a collaboration with our TRLN partner institutions. There is a common data stream that includes records from Duke, UNC, NC State, and NC Central. There is also a common application (“TRLN Discovery”) that sets up a basic catalog interface for all of us. NC Central uses the basic application, if you want to see what that looks like. The rest of us add customizations on top of the basic application to tailor it to our needs.

Each catalog allows users to toggle the search context from their local university (e.g., Duke) to all of TRLN. This allows users at each institution to quickly see and request items that are available at another TRLN institution. This is only possible because we are all transforming our data into a common format and publishing it to a shared data stream.

Most of the issues we’ve been addressing with work on the catalog stem from a difference in how Alma handles requesting and availability. Alma focuses on title-level requesting…
Getting our data out of Alma is also more difficult because we have to request data through APIs instead of having direct database access like we used to with Aleph. There are limits to how many times we can query the APIs each day, and the data we need for the catalog has to be blended from multiple places. To limit our usage of the APIs, we have adjusted our approach to showing availability – we now retrieve availability from Alma only at the title (or holdings) level, instead of at the item level, for everything except Rubenstein and University Archives items. For those items, users can still see item-level availability on the item detail pages in the catalog. For all print materials, though, users can click the green Request button to see availability at the item level.

Recent updates to the Books and Media Catalog

  • We have completed work on Duke customizations to show availability at the holdings level on both search results and item detail pages, regards of whether the user is searching just Duke’s records or all of the TRLN records.

  • Recently, we also updated the common catalog application so our TRLN partner institutions can also display our holdings-level availability data.

  • Added a clearer banner to alert patrons to ongoing issues and better match the style of the banner on our public website

  • Added the barcode display back in to facilitate various workflows

  • Improved the real-time availability check at the holdings level for search results pages and at the holdings and items levels for item detail pages.

  • Fixed the “show all items” functionality for long lists of items

  • Fixed an issue where items in temporary locations weren’t showing up correctly or linking to the correct location map. Fixing this also fixed a problem where Rubenstein records were not showing all items on the item details page.

  • In some cases, Alma can’t determine availability at the holdings level, and we need to point people to the Request app so they can view availability at the item level. We now display a link to the Request page labeled “Availability Details” in place of the availability information at the holdings level.

  • Fixed an issue where clicking on the map icon returned an unhelpful error message

  • Fixed an issue where more than just Rubenstein and Archives items were showing item-level availability on item detail pages

  • Addressed a bot attack that was slowing down performance by blocking a list of IP addresses

  • Updated the list of collection codes to include new locations like Bishop’s House

  • Analyzed 1.3M records in the catalog data stream that haven’t been updated since go-live, as those records can’t be coming from Alma. Almost all are duplicate records left over from 360KB. The remaining 2,035 have been reviewed and all but 2 can be deleted from the common data stream.

  • Implemented a “guest request” link for visitors to request LSC items

Upcoming Books and Media Catalog work

  • Improve our automated testing to make it easier to see if code changes have broken anything else

  • Support searching by the new Alma ID and Duke’s TRLN ID

  • Improve real-time availability checking for items at temporary locations

  • Continue to adjust display of bound-with items

  • Ford Kindle e-books collection not displaying properly

  • Complete a major underlying software upgrade from Blacklight 7 to 8

Catalog Request System

Context

The Catalog Request System is the system that drives the real-time availability updates in the catalog UI, and it is also where users go when they click on the big green Request button.

Recent updates to the Catalog Request System

  • Changed how items are identified to prevent collapsing of items with identical enumeration

  • Fixed an issue where Rubenstein items weren’t showing up if there were also LSC items

  • Fixed issues where Rubenstein and other items were requestable when they shouldn’t be (“In process”, “Technical-Migration”, “Acquisition”)

  • Fixed the ability to request bound-withs

  • Fixed an issue with the logic for renewing items

  • Improved performance of the Request app with large sets of items (greater than 50)

  • Fixed an issue with display of Interlibrary Loan Request (should appear if at least one item has a status other than “On order” and patron has ILR permissions)

  • Updated to be able to show Bishop’s House as a pick-up location

Upcoming Catalog Request System work

  • Sort items by holdings, then enum/chronology

  • Fix problems with availability and requesting for items with relations (i.e., related physical and electronic items)

  • Explore problems with title-level requesting that cause a loop or a failed request