and
Tommy Schomacker,
Head of Computer Department,
Danish Library Centre
DanBib is a union catalogue for all libraries in Denmark. This paper presents an approach for applying data mining and fuzzy logic in evaluation of queries to this database. In the approach a fuzzy semantic network relating keywords from the database is used. The network exploits the actual content of the database and is obtained by pure statistical means. Even so the relations between words resemble human association, sufficient to extend answers in an intuitive manner. The approach has been implemented as a web-interface to DanBib, and an experiment has been carried out.
As a result of this comprehension a new project was formulated, now to construct a prototype of a user friendly interface to DanBib. The prototype is made as a plug-in module to the DanBib system. The project team has consists of scientists and programmers from ISL and librarians and programmers from DBC.
In the prototype a query is evaluated by flexible interpretation in two respects, partly by applying fuzzy evaluation of each single criterion stated and partly by applying a tolerant interpretation of the compound query by adjusting the connection of all criteria within a range, taking disjunction as the least and conjunction as the most restrictive connection.
The Danish Library Centre (DBC) was responsible for BASIS and The Computer Department for Danish Research Libraries (FEK) was responsible for ALBA.
The main part of the Danish bibliographic records in BASIS was created and maintained by the Danish Library Centre as an improvement of the Danish national bibliography, while the majority of the foreign records were delivered by the public libraries. Re-use of these data was made by weekly standardized data deliveries on floppy discs to public and school libraries for use in their purchase of materials - with succeeding reporting to BASIS, so that the database could be used for localization. The research libraries’ re-use of the national bibliographic records was made through weekly transfers from BASIS to ALBA.
ALBA was maintained by bibliographic records from Danish research libraries and national bibliographic data from Denmark and foreign countries. Herewith ALBA could use for localization purposes as well as a reservoir for re-use for cataloguing.
As producer of the Danish national bibliography the Danish Library Centre had a close cooperation with the two legal deposit libraries - the Royal Library and the State and University Library. This cooperation, however, was difficult due to the fact that the three institutions were cataloguing to there separate databases - with partly different cataloguing practices.
It seemed evident that the registration process as well for the national bibliography and the legal deposit administration as for the re-use in the libraries could happen quicker and more rationally if the work could take place in one common database.
A working group established for evaluation of the legal deposit administration also came to this conclusion in its final report in April 1991.
At a common library meeting in September 1990 this idea was specifically formulated and was generally supported. Then the work started with formulating a data model for the common database, i.e. in reality a common reference framework for the libraries. An important element was the determination of the principle “one title - one record” - in other words, contradictory to earlier, where there for each title was a record for each library, there should only be one bibliographic record to a given title. This main principle was kept by the implementation, but the chosen data model still allows the participating to add local data to the title record.
Parallel to the considerations of the data model, which was of great importance to the internal work and cooperation between the libraries, a pure edp technical evaluation of the demands to the new common system took place. The basic idea in these considerations was open standards.
The search for a suitable system began as an invitation to submit tenders, but after a long and complicated process, it was early 1993 ascertained that none of the offered systems were suitable for this specific task. Hereafter it was determined to develop the system on the “building block model”, i.e. piecing the system together by existing final modules together with own development.
The third element of the solution - next to the data model and the edp system - was the organizational part. It was decided to merge the DBC and FEK and hereto assign the task of developing and running of the common system DanBib.
After an intense developing effort in 1993 DanBib opened in January 1994 - first only with ALBA adjusted. Version 2 - with BASIS included - opened July 1995. There is a long line of ongoing and planned developments of DanBib.
DanBib contains approximately 10 million records and serves a total of 2300 users (up to 500 concurrent users). Half a million interlibrary loan orderings are transmitted in a year.
Central to all tasks is the search facility which today is based on the international standard ISO 8777:1993 Information and documentation -- Commands for interactive text searching - also called Common Command Language (CCL). The basis for this is Boolean logic, which can be an efficient tool to the experienced user of the system. For the less experienced user, however, there is a need for more flexible search tools, which is the reason for the interest in the opportunities of using fuzzy logic.
In CCL the search expression is FIND followed by a search argument. The search argument may consist of search elements combined with Boolean operators, e. g. “FIND gold or silver”. In this paper we will use the term “compound query” instead of “search expression” and “constraints” instead of “search elements”.
In natural language we often use words and concepts that are better modeled with fuzzy logic than with classical two-valued logic. For instance to express that a book is new, using only two-valued logic, it is necessary to decide on an exact limit on age, to distinguish "new book", say, books published within the latest 3 years are new, and books older than that are not new. Using fuzzy logic it is possible to get a lot closer to the intention with the concept. We could decide on a simple function as shown in figure 1, to express that, books published within the latest two years are new, books published within two years and four years ago are new to some degree, and books older than that, are not new.
Apart from introducing new concepts, we can use fuzzy logic to generalize equality into similarity. For instance the expression "equal to 3" can be relaxed into "around 3" and the expression "between 2 and 4" can be relaxed into "around 2 to 4" as shown in the two membership functions below.
In fuzzy querying one important logical aspect is this kind of similarity for elements of a given domain.
For domains with a natural order it is straight-forward to define similar values for a given value; these are the values closest to the given value according to the order.
Problems arise when the domain in question is without ordering. In that case similarity must be explicitly defined in some kind of network relating similar elements. Of major interest in our approach are domains of words used in describing properties of objects, i. e. bibliographic entities. In a bibliographic database these may be authors keywords, cataloguers keywords, words appearing in titles, notes, abstracts or other textual attributes attached to objects.
Examples of explicitly defined similarity on domains of words are common natural language thesauri or word (and concept) taxonomies. Special cases of the latter are commonly used in the library cataloguing process. Such similarity on domains of words or concepts can, just as it is the case for simple numeric domains, by relaxed into a kind of graded similarity. For instance a relaxed thesauri could express:
In the approach described here we mainly focus on a third kind of similarity that gives graded relations between words of the domain expressing associations between words.
While a thesaurus relates words that are similar – to some degree – with respect to meaning, an association structure relates words that are similar with respect to context:
We call explicitly defined similarity for semantic networks and, when grades are attached to relations, fuzzy semantic networks. We apply in our approach a statistical based fuzzy semantic network, denoting a special interpretation of associations.
That is, bibliographic objects (bibliographic references) described with 6 attributes, where 3 of these can take multiple values (indicated with *). Of the attributes mentioned, especially the multi-value attribute Keyword is of interest. This attribute holds one or more words selected from a classification system or a controlled list of keywords and attached by the cataloguer.
Our approach to fuzzy querying extends conventional Boolean querying in two respects:
The former leads to a tolerant aggregation of constraints, whereas the latter is about relaxing a single constraint into a more tolerant counterpart based on predefined parameters delimiting the fuzzyfication.
We present below first the fuzzy interpretation of a compound query. We introduce then a special semantic network applied in the prototype, which is purely based on statistics on the state of the database. Finally we describe the overall principle of query evaluation and give examples of queries evaluated against the small database used in the experiment with the implemented prototype.
The principle of aggregation of a compound query is as follows.
Thus the truth value of "O satisfies Q" is the average of the n single truth values of "O satisfies Ci".
Notice that this principle of query evaluation may be applied without any other kind of weakening. If all single constraints are crisp (can take only the values true or false, that is, 1 or 0), then the interpretation becomes: n satisfied constraints is considered better (is graded with a better truth value) than n-1 satisfied constraints, which in turn is considered better than n-2 satisfied constraints, etc.
Even though this principle is simple it may appear quite powerful in practice. It supports a new and better means for users querying the database. It is no longer necessary to minimize the set of constraints posed in a query as normally done by users who prefer non-empty answers. Suddenly it becomes fruitful to express all potentially relevant constraints describing the users intention, since the process of minimizing the set of constraints posed in a single query, which is usually performed by the user who prefer non-empty answers, is so to say replaced by the simple weakening performed by the system.
For this purpose the most interesting domains in the bibliographic database is the set of words used in different descriptions of objects in the database, e.g. in titles, abstracts, notes, related keywords and even author.
For the experiment discussed in this paper we have chosen to build a network based on cataloguers keywords with no influence on associations from words as used in other fields describing objects. Thus even titles are ignored when associating words. We have chosen a simple function that relates two words A and B based on the frequencies with which they appear in descriptions of objects in the database. The function is asymmetric and takes as arguments:
The word A relates to B to the degree:
B relates to A similarly by:
We should emphasize here that this function has been chosen as an ad hoc first solution and that further studies has to be done to investigate the problem of statistical association.
Thus with the frequencies of the words A and B, as shown in figure 4a, that is, #(A) = 7, #(B) = 19 and #(A and B) = 4, we obtain the association from A to B as #(A and B) / #(A) = 4 / (7+1) = 0.5 and the association from B to A as #(A and B) / #(B) = 4 / (19+1) = 0.2.
As an example of what this means of deducing associations between words might lead to, consider the following small section of a semantic network, derived from a real world database used in the experiment:
The approach described in this paper is, as far as softening of words as query constraints are concerned, solely based on semantic networks as described above. So an important question is of course: What are these kind of associations and how can their meaning be described?
The intention is to capture relations that resemble aspects of human associations. As compared to the other extreme to achieve this; knowledge acquisition by interviewing an expert in the domain in question, one major drawback appears to be obvious. Words that are closely related by association may also be closely related by meaning, as for instance parent and mother. When two words are closely related by meaning the typical description will include only one of these words, and therefore we obtain no or only a weak association between the two words in this case.
However, since our semantic networks are based only on the set of attached keywords and since each word in this set is carefully chosen to be with a distinct meaning, this drawback have only little influence on the quality of the resulting network.
A query is restricted to be conjunctive, thus of the form
The general form of a simple query constraint Qi is Ai Fi Vi, where Ai is a database attribute, Fi is a comparison operator and Vi a value. However we shall, to keep things simple, ignore constraints other than on the attribute "keyword" using the "="-operator and then take each Qi to be just a value for keyword.
A query, such as
is evaluated under two threshold values. A threshold TA that delimits the extend to which the aggregation can be relaxed and a threshold TC that delimits the extend to which a single constraint can be relaxed. With TA=1 and TC=1 the query is interpreted as a Boolean query. Keeping TC=1 and lowering TA to say 0.75, matching objects will include those that meet only 3 out of the 4 constraints, while TA=0.5 accept 2 out of the 4 constraints. The influence of lowering TC is as follows. If we set TC=a then each constraint Vi is relaxed to be matched by the set r(Vi), listing ("matching keyword", "grade"), r(Vi) = {(Vi,1), (Vi1,s1), … (Vik,sk)}, that includes all values Vij that Vi associates to, to a grade sj>=a.
Thus for the query
we get, when referring to figure 5 with TC=0.8
while TC=0.6 leads to
We use as mentioned average for aggregation, thus, when still referring to figure 5, we get objects satisfying Q to the grades:
Objects including one of the subsets as keywords match to grade {Death,Childhood} 1.00 {Death,Children} {Grief,Childhood} 0.90 {Grief,Children} {Children,Childhood} 0.85 {Parents,Childhood} 0.80 etc. {Death},{Childhood} 0.50 {Grief},{Children} 0.45 etc. Fig. 6. Average aggregation over objects described by keywords including one of the subsets shown
Now setting thresholds TA = 1 and TC = 1 is the conventional Boolean query and results only in objects including {Death,Childhood} as keywords. Keeping TA=1 and lowering TC does not change anything, since TA delimits the aggregation, which is average and thus falls below 1, when one constraint is satisfied below 1. Keeping TC=1 and setting TA=0.5 corresponds to the simple weakening principle described above and adds to the answer objects that include either {Death} or {Childhood} as keywords. Setting TA=0.85 and setting TC=0.7 or lower leads to an answer with objects including as keywords either of the subsets listed in the 3 first lines of figure 6.
To give an impression of how this works in practice an answer to the query Q = Death and Childhood is shown in figure 7 below. The query is evaluated on the about 6000 objects database, which is also the basis for the fuzzy semantic network shown in figure 5. Without relaxing the query, thus with TA=1 and TC=1, the answer is empty. For the answer shown the chosen thresholds are TA=0.80 and TC=0.80. Only titles and query grades for the books are listed.
1 0.95 Only a broken heart knows 2 0.95 Baby's death – "life's always right" 3 0.95 We have lost a child 4 0.95 Loosing a child 5 0.95 Cot death 6 0.95 When parents die 7 0.95 To say good-bye 8 0.95 We have lost a child 9 0.95 A year without Stig 10 0.95 Taking leave 11 0.95 Grief and care in school 12 0.89 Can I die during the night 13 0.89 Children's grief 14 0.89 Children and grief Fig. 7. The answer to the query Q = Death and Childhood, when evaluated on the database in the experiment.
We should emphasize that the database used in the experiment is not fake. The content is descriptions of "real world" books on issues related to children, school and education. All descriptions are made by cataloguers, that have carefully examined the books and chosen distinct, describing keywords from a list of controlled keywords and according to generally accepted international guidelines (ISO 2788 and ISO 5963).
The resulting network is thus the implicit work of the cataloguers.
Of other ideas under current discussion we can mention the following.
First of all we consider domain knowledge of different forms as basis for advanced query processing against the bibliographic database. Fuzzy semantic networks as exemplified above may be complemented with thesaurus structures that capture narrower and broader terms and with structures that cluster synonyms for uniform interpretations in query specifications. Further we intend to include morphological conversions of terms to the standard forms used in the chosen vocabularies. We are doing experiments with semantic networks based also on titles, notes and authors from bibliographic objects and especially for networks based on these attributes, the handling of synonyms and morphological conversions are needed.
As a second type of knowledge to be applied as basis for advanced query processing we are considering users profiles. A user profile represent what the system knows about the user and may be applied by the system to customize the answer. The profile may in part be given by the user and in part be adapted by the system from the users behavior when using the system. A profile can include knowledge about topics of interest, preferred settings of for instance display formats and of default constraints for queries, measures for determination of relevance and assumptions that becomes interesting when violated.
Based on the associations drawn from data catalogued under two or more systems we believe to be able to construct cross system relations, that represents vague, but useful conversions among the systems. Thus when a satisfactory large data set is available we can build a concordance for the classification systems used. It is in practice, by no means, possible to construct a complete concordance, but we should stress here that the result of an automatically generated concordance must be expected to be "more incomplete" than what could be expected as a result of the work of experts on the systems involved in he concordance. Even so, an automatically generated concordance can be used as a guide and, as maybe the most important property, when the concordance is available, keywords from one system can be used in queries to data that is not catalogued in that system, because translations can be performed based on the available concordance.
As a further refinement of query specifications we are experimenting with importance weighting of individual constraints in the compound query. While importance of course are annoying when it is difficult or even irrelevant to decide on these, they can be very helpful in cases where for instance a few constraints are almost ultimate while a number of additional constraints may be quite relevant to describe the need, but only when they can be subject for relaxation.
To handle answers enclosing to many objects we suggests methods clustering to groups, each representing a topic. Two approaches are considered here, both based on cataloguers keywords attached to descriptions of objects. One approach is based on generalization to broader terms from the union of keywords in the answers, the other is based on eliciting the most frequently associated words, also from the union of keywords.
Finally we can mention that we are experimenting with generation of explanations of small or empty answers to queries. Such explanations could be given as a guide to the user and include for instance information, concerning the posed query, about single constraints that are unsatisfied or sets of constraints that conflicts in the current state of the database.
The functions used in this process are deliberately chosen as simple and naive, partly because we want the principles to be as easily comprehensible as possible, and partly because further research has to be done to reveal and identify properties that make functions suitable and more accurate.
Even so, we conclude that the derived associations in general appears to be meaningful and that when used in the described processing of queries, the associations leads to meaningful and quite useful extensions of answers to queries.
The derived associations are based on cataloguers keywords and is therefore a special indirect result of the cataloguing process, as a kind of inversion of the cataloguing data.
The keywords are specifically chosen as “descriptors” from a continuously controlled vocabulary on approved words, that have a very low degree of ambiguity (most words/concepts have unique meaning).
We must give the credit for the useful derived associations to the high quality of the available cataloguing data.
Andreasen T.: On flexible query answering from combined cooperative and fuzzy approaches. in Proc. VI IFSA World Congress, Sao Paulo, Brazil, 1995.
Andreasen T. and Pivert O.: Improving answers to failing fuzzy queries. in Proc. VI IFSA World Congress, Sao Paulo, Brazil, 1995.
Andreasen T. and Pivert O.: On the weakening of fuzzy relational queries. in Proceedings ISMIS '94, EIGHT INTERNATIONAL SYMPOSIUM ON METHODOLOGIES FOR INTELLIGENT SYSTEMS. 1994. Charlotte, North Carolina: Springer Verlag, Lecture Notes in Artificial Intelligence.
Christiansen H., Larsen H. L. and Andreasen T.: Proceedings of the second workshop on Flexible Query-Answering Systems. Datalogiske Skrifter no. 62. 1996, Roskilde University.
Emneord på bøger og artikler: indekseringsregler. Version 2.0. Dansk BiblioteksCenter, 1996 (internal guidelines)
Friis-Hansen J. B.: Vejledning i indeksering (informal translation of ISO 5963:1985)
ISO 5963:1985 Documentation -- Methods for examining documents, determining their subjects, and selecting indexing terms
ISO 8777:1993 Information and documentation -- Commands for interactive text searching
ISO/DIS 2788:1986 Documentation -- Guidelines for the establishment and development of monolingual thesauri
Klir G. J. and Volger T. A.: Fuzzy Sets, Uncertainty, and Information. 1988, Printice Hall.
Larsen H. L.: Recognition of implicit concepts in unstructured queries. Proc. VI IFSA World Congress, Sao Paulo, Brazil, 1995, Vol. 1, pp. 233-236.
Larsen H. L. and Andreasen T.: Proceedings of the first workshop on Flexible Query-Answering Systems. Datalogiske Skrifter no. 58. 1995, Roskilde University.
Larsen H. L. and Yager R. R.: An approach to customized end user views in information retrieval. In J. Kapcprzyk and M. Fredrissi, Eds., Multiperson Decision Making Using Fuzzy-sets and Possibility Theory, Kluwer Academic Publishers, pp.128–139, 1990.
Larsen H. L. and Yager R. R.: Query Fuzzification for Internet Information retreival. Datalogiske Skrifter No. 60, 1996. To appear in Fuzzy-Set methods in Information Engineering: A Guided Tour of Applications, John Wile & Sons.
Larsen H. L. and Yager R. R.: The use of fuzzy-relational thesauri for classificatory problem solving in information retrieval and expert systems. IEEE J.on System, Man, and Cybernetics 23(1):31–41, 1993.