Where Do We Meet? Perspectives from Software Developers and Subject Specialists on Creating Machine Learning Projects [webinar]
The IFLA Subject Analysis and Access Section’s Working Group on Automated Indexing invites you to the following webinar:
Where Do We Meet? Perspectives from Software Developers and Subject Specialists on Creating Machine Learning Projects
When: Wednesday, November 9, 2022, 14:00 – 18:00 CET
- Supporting Subject Librarians with AI Solutions, Osma Suominen, National Library of Finland
- A Case Study on Applying Machine Learning Methods in Automated Subject Headings to Dataset Records, Mingfang Wu, Australian Research Data Commons
- Comparing Methods for Automated Keyword Extraction—Insights and Pitfalls in Setting Up and Evaluation Plan for Method Selection, Maximilian Kähler, National Library of Germany
- Interdisciplinary Teamwork as a Success Factor: Competencies Needed for Adaptation and Implementation of a Subject Indexing Support Software at the Leibniz Information Center for Economics (ZWB), Claudia Liebetruth, ZBW
- Autocategorization Projects: a Taxanomist’s Perspective, Bob Kasenchack, Factor
- Panel Discussion featuring Diane Rasmussen Pennington, University of Strathclyde
About the speakers and presentations
Osma Suominen is an information systems specialist at the National Library of Finland. He is currently working on automated subject indexing, in particular the Annif tool and the Finto AI service, as well as the publishing of bibliographic data as Linked Data. He is also one of the creators of the Finto.fi thesaurus and ontology service and is leading the development of the Skosmos vocabulary browser used in Finto. Osma Suominen earned his doctoral degree at Aalto University while doing research on semantic portals and the quality of controlled vocabularies within the FinnONTO series of projects.
Abstract: This keynote presentation will discuss many aspects of implementing AI solutions that support subject librarians in their daily work of classifying and/or indexing documents. The topics include how to prepare for AI projects, how the choice and structure of vocabularies affects the feasibility of AI solutions, what kind of training and evaluation data is required and how to interface between AI developers and librarians who will eventually use the system. The presentation is informed by the speaker’s experiences building the open source Annif automated subject indexing toolkit and the Finto AI automated subject indexing service, which is built on Annif.
Dr. Mingfang Wu is senior research data specialist at the Australian Research Data Commons (ARDC). She has conducted research in the areas of interactive information retrieval, search log analysis, interfaces supporting exploratory search and enterprise search. Her recent research focuses on the data discovery paradigms as part of the Research Data Alliance initiative and for improving data discovery service of an Australian national research data catalogue.
Abstract: This presentation will focus on an experiment and lessons learned from applying machine learning (ML) approaches to annotate metadata records with subject headings. The metadata records are from a national research data catalogue: Research Data Australia (RDA). RDA is a data registry and multidisciplinary data discovery portal, it is currently harvesting metadata of datasets from over 100 data repositories from Australia universities, governmental agencies, research and culture institutions, etc.. RDA metadata schema enables the capture of subject headings from standard or community adopted subject heading thesauruses or classification codes. Since approximately half of RDA metadata records do not include a subject heading at the time of harvest, the team experimented with automatic subject metadata annotation approaches to enrich these records in order to enhance data integration and discovery. The experiment result indicates the asymmetric quality among annotated subject headings; due to some subject headings having plenty of instances and some having too few, and metadata records with well performed subject headings tend to have features that can distinguish themselves from other headings. The experiment raises questions including where and how to get quality training data for ML, how to effectively apply the trained ML models to a catalogue while the catalogue’s content may be involving, how to treat the subject headings suggested by ML models when users are presented with enriched metadata records, and the generalisation of trained models to other data categories.
Maximilian Kähler acquired degrees in mathematical sciences from the universities of Göttingen, Durham (UK) and Leipzig. After completing his studies, he specialized as Data Scientist and Research Software Engineer. Prior work has led him to the Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG) in Berlin and the Helmholtz Center for Environmental Science (UFZ) in Leipzig, before joining the German National Library in October 2021. Mr. Kähler is part of the Department for Automatic Indexing and Online Publications. As a scientific employee, he is part of a research project that investigates the possibilities to exploit recent advances in natural language processing and novel machine learning approaches for the task of automated subject indexing.
Abstract: Is automatic keyword extraction just another (extreme) multi-label learning problem and are we done with optimizing F1-Score? What does a good test set for automated indexing look like? What dimensions of content need to be addressed? In the available zoo of evaluation metrics, which are useful for diagnosing strength and weaknesses in available methods? While there is, obviously, no universal answer to any of these questions, this talk with look at the team’s experiences and illustrate their design choices for an evaluation plan of keyword extraction methods at the German National Library.
Claudia Liebetruth is a subject librarian at the ZBW Leibniz Information Centre for Economics, Germany. As a project manager for the “Digital Assistant” she takes part in automatization efforts of the ZBW. She holds a master’s degree in international management. She started her career as a learning and development specialist.
Abstract: How do we keep up with quality standards in intellectual classification in the face of ever-increasing publishing output? To answer this question, a web-based tool was developed to assist intellectual classification and subject indexing in a cooperative project between several libraries in German-speaking regions and a commercial software developer. The adaptation and implementation of the software at the ZBW required distinct expertise during each project phase: technical and library-specific competencies as well as project management and training skills. This presentation will describe how the different competencies were brought together over the course of the project.
Bob Kasenchak is an information architect at Factor. A taxonomist and ontologist with an interest in knowledge graphs and Linked Data, he has worked for over a decade building and implementing taxonomy projects for publishing, enterprise, technology, and e-commerce clients. He brings experience with information modeling and semantic software to client-focused metadata and vocabulary projects. Bob holds an MM in Theoretical Studies from the New England Conservatory of Music and a BA in Liberal Arts from St. John’s College, Santa Fe. A frequent writer and presenter on semantic topics at conferences and in journals, Bob’s ongoing research interests include ontologies, knowledge graphs, and automatic text classification.
Abstract: Automatic text classification (or autocategorization) projects bring together people, processes, and technologies to provide a capability: accurate application of subject metadata to assets to improve discovery. Building out this capability for a particular organization or business need requires collaboration between domain experts, information specialists (librarians, taxonomists, archivists), search engine specialists, project managers, developers, computer and data scientists, content creators and managers, and other stakeholders with complementary points of view and expertise. This presentation will focus on the current state of the art of autocategorization and discuss the ways people, processes, and technology interact. The talk will cover autoclassification methodologies as well as the role of taxonomies (and taxonomists) in text classification projects.
Diane Rasmussen Pennington is a Senior Lecturer (Associate Professor) in Information Science at the University of Strathclyde in Glasgow, Scotland. She teaches courses in organization of knowledge, cataloging, and library systems. Her research areas include library linked data, tagging, classifying user engagement on social media, and ethical cataloging. She is the Chair of CILIP’s Metadata and Discovery Group as well as an Honorary Member and a Trustee of CILIP Scotland. She is a member of IFLA’s Standing Committee on Education and Training (SET), TESA, and the Building Strong Library and Information Science Education (BSLISE) working group.