A note on knowledge organisation

http://www.lucis.me.uk/knowlorg.htm#start

Brian Vickery

What is now grandly known as 'knowledge organisation' has a long history. The simplest forms of a knowledge organisation system (KOS) are, after all, the contents list and the index of a textbook. The knowledge is in the text; the KOS is a supplementary tool that helps the reader to find his way around the text. But as such finding aids have become more complex, and taken on wider functions, they have acquired grander names, such as retrieval languages, taxonomies, categorisations, lexicons, thesauri, or ontologies. They are now seen as schemes that organize, manage, and retrieve information.

The basis of any modern KOS is an assembly and display of words or phrases (I will refer to them both as 'terms') with some indication of semantic relations between them. This is a definition that would cover dictionaries, glossaries, subject heading lists, linked lists (in the computer sense), semantic nets, frame and slot structures, concept or topic maps, and other 'term stores' as well as those mentioned above. So it is not surprising that contributions to the ideas of KOS come from a variety of fields - indexers, subject cataloguers, linguists, lexicographers, taxonomists, logicians, computer programmers, artificial intelligence workers, even philosophers. The writings of Sowa provide examples of all these influences. If the definition above be accepted, a book index (without any cross-references) cannot be called a KOS, since it displays no semantic relations. Nor can a search engine index constructed by extracting single words from text, though the word positions it records enable the display to the user of a 'snippet' of text that does provide words-in-relation.

A KOS is something more than its basic term list. As the snippet example shows, we must also take into account the way that the terms are used in the whole system. In many retrieval systems, for example, terms are not used separately, but are linked in various ways to form compounds that, either explicitly or implicitly, display relations between the linked terms. The following section sets out the kinds of semantic relation displayed in different systems used in the last century.

Traditional types of KOS

(1) Alphabetical subject headings: the list may have see references that link terms to others that, in the KOS concerned, are treated as synonyms; further, it may have see also references indicating indeterminate semantic relations to other terms.When the terms in the list are combined to form compound index headings, the relations between the combined terms are not explicit, though they may be understood by the knowledgeable user.

(2) Enumerative, precoordinated classifications: ostensibly, the hierarchical links represent the generic relation between a class and its subclasses, but in practice they may also be used for the class-membership relation. The nature of the link becomes somewhat indeterminate when, for example, a part or attribute is shown as a subclass of an entity. The schedule may include see references of indeterminate nature from one class to another, and the alphabetical index to the classification may use them to link synonyms.

(3) A faceted classification subdivides a subject field or domain into explicitly named facet categories before introducing hierarchy. Various suggestions for a set of 'fundamental' or generally applicable categories have been put forward, such as entity, part, attribute, operation, place, time. Compound index headings are created by linking terms from different facets. When a term from, say, an entity facet is linked to one from an attribute facet, the link in fact represents the entity/attribute relation. The linkage of categories is thus a way of expressing semantic relations. Within a facet, hierarchy may express either the generic or the class-membership relation; sometimes, because facets may be 'telescoped' to produce a more compact schedule, the nature of the hierarchical link may not be made explicit. A given class may be subdivided in more than one way, e.g. machines (entities) according to the operation they perform, or according to the material on which they work, or according to the nature of their end-product, thus introducing categories associated with entities as characteristics of division. Faceted classifications (or indeed any classifications) may also use 'relational indicators' to express, for example, the 'influence' of one subject on another. See references are used as in enumerative schemes.

(4) Alphabetical thesauri explicitly indicate broader/narrower links which express either the generic, the class-membership (instance) or the whole/part relation, and use/use for to link terms treated as synonyms in the KOS. The associative 'related term' link is indeterminate. The various meanings of homonyms may be differentiated by subject field markers (qualifiers) or scope notes. Compounds formed by the combination of terms have the same characteristics as in alphabetical subject headings.

(5) Thesaurofacet schemes: these explicitly link a faceted classification with an alphabetical thesaurus. The alphabetical list can display broader/narrower links not explicit in the classification schedule, as well as associative relations and synonyms.

(6) Some post-coordinate indexing systems have used 'role indicators', attachable to terms when they are combined into a compound. Thus we might have 'surface/4 - cleaning/8 - sandblasting/9', where role 4 = entity, 8 = operation, and 9 = agency. In effect, at the time of combination, each term is assigned to a category. One system was even more explicit, expressing, for example, the entity/operation relation by adding a reciprocal role indicator to each term; so we would have 'surface/A - cleaning/B,C - sandblasting/D', where A = entity operated on, B = operation on entity, C = operation effected by, D = agency effecting operation. Among the relations used in such systems were entity/attribute, entity/operation, operation/agency, operation/product, entity/component, property/measure (see page 145 in Vickery, 1973).

(7) Online search systems often use alphabetical thesauri as part of their overall system, employing terms extracted from them as 'descriptors', but they may add further 'subject indication' in the form of a summary or abstract of the text, from which more indexing terms are extracted. When displayed as part of the output of a search, the abstract serves the same purpose as a 'snippet' in showing words-in-relation.

(8) Term lists used in online search interfaces (Vickery and Vickery, 1993) found it useful to attach one or more of the following kinds of information to each term: part(s) of speech, semantic category, subject area marker, classification code, scope note, definition, links to semantically related terms, rules for disambiguating multi-meanings.

We have seen a development from the generally indeterminate semantic relations of alphabetical indexes to various methods of spelling out ever more explicitly the semantic relations used within the KOS. This appeared to be swept away by the arrival of Internet search engines, which rely on single-word indexing with no semantic attachments. But in recent years the trend towards semantically more complex term lists used in KOS has been resumed. There has been renewed interest in facet analysis, and a new interest: the transformation of thesauri into ontologies.

Ontologies

One definition of an ontology is 'a systematic formalization of concepts, definitions, relationships, and rules that captures the semantic content of a domain in a machine-readable format'. One important aspect of an ontology is that it is a KOS designed, not only to be in machine-readable format, but also to be usable by computer software in automated knowledge management within the subject domain.

As an example of current ontologies, I will describe those developed by Teknowledge (Nichols and Terry). There are a number of basic types of term: class, individual, attribute, relation (predicate or function). By combination of terms, assertions are created and entered into the KOS.

Classes 'are like generic nouns that can be applied to distinct, named or nameable, individuals (examples of classes are Human, Dog, Company, Assassination, Cleaning)'. There are classes of entities and of events. To each class is attached a clear definition that captures its meaning. Classes in a domain are arranged hierarchically. The generic relation between a class and a subclass is set out explicitly in an assertion by using the predicate subclass, as in the example 'subclass Terrier Dog'. The instance relation between a class and an individual is also explicitly asserted, e.g. 'instance Blackie Terrier', 'instance KennedyAssassination Assassination'.

Attributes are the qualities or properties of classes or individuals, and these too are explicitly asserted in the KOS: 'attribute Terrier Furry'. Attributes too can be arranged hierarchically by assertions such as 'subattribute Red Colour', 'subattribute Scarlet Red'. Any individual or subclass 'inherits' all the characteristics of its parent class.

Predicates explicitly display relations between classes or individuals, entities or events. I have already noted the special cases of subclass, instance, attribute. But any relation can be used as a predicate, e.g. 'father Brian Adam', 'employee Newspaper Adam', 'belongs Brian Blackie'. A predicate would also be used to assert synonymity: 'identical Buonaparte Napoleon'.

As well as assertions, such an ontology contains inference rules using an 'if-then' operator, for example: 'if (instance X Dog) then (chases X Cats)', where X = any individual. Coupling this with 'instance Blackie Terrier' and 'subclass Terrier Dog' leads us to conclude that 'chases Blackie Cats'.

The main characteristics of an ontology, compared to traditional KOS, are therefore (a) every semantic relation between terms, including generic and class-membership relations and synonyms, is explicitly asserted, and (b) inference rules link assertions so that deductions can be made from explicit assertions to others that are logically implied by them. In this way, 'the semantic content of a domain' is captured.

Developing thesauri

The limitations of existing KOS have been summarized by Soergel et al. as follows:

Lack of conceptual abstraction: thesauri and other traditional KOS are collections of terms (generic or domain-specific), ordered in a polyhierarchic lattice structure or a monohierarchic tree structure and interlinked with some very broad and basic relationships. The distinction between a concept (meaning) and its lexicalizations (words) is not made consistently, if at all, in such a system, and as such it does not reflect the ways humans understand the world in terms of meaning and language.
Limited semantic coverage: most thesauri do not differentiate concepts into types or categories (such as living organism, substance, or process) and have a very limited set of relationships between concepts, distinguishing only between hierarchical relationships, i.e. NT/BT, and associative relationships, i.e. RT. These very rudimentary relationships are not powerful enough to guide a user in meaningful information discovery on the Web or to support inference. They do not reflect the conceptual relationships that people know and that can be used by a system to suggest concepts for expanding the query or making it more specific.
Lack of consistency: since the relationships in thesauri lack precise semantics, they are applied inconsistently, both creating ambiguity in the interpretation of the relationships and resulting in an overall internal semantic structure that is irregular and unpredictable. Many of the NT/BT hierarchical relationships could, for example, be resolved to the non-hierarchical RT relationship, and vice versa.
Limited automated processing: traditionally thesauri were designed for indexing and query formulation by people and not for automated processing. The ambiguous semantics that characterizes many thesauri makes them unsuitable for automated processing.

The indeterminate nature of the 'related term' link in thesauri has long given rise to discussion, and the ANSI/NISO Guidelines for monolingual thesauri sets out a series of possible semantic relations that an RT link might represent, for example: field of work/practitioner; operation/instrument; process/agent; process/counteragent; action/product; action/target; object/special attribute; property/measure; measure/instrument. Other relations proposed are: source/product; action/property of action; operation/method; object/use.

Tudhope et al. urge that the RT link in thesauri be replaced by a set of more specific relational links, and the need therefore for a core set of associative relations to be incorporated in future guidelines.

In the light of the limitations of existing KOS (listed above), Soergel et al. have been exploring the conversion of a traditional (agricultural) thesaurus into ontology format. This would involve representing BT, NT and RT links in the form of specific predicates, for example:

BT replaced by (memberOf) or (isa) or (component of) or (spatiallyIncludedIn)

NT replaced by (hasMember) or (includesSpecific) or (hasComponent) or (spatiallyIncludes)

RT links would be replaced by more specific relations, such as those set out below:

X (causes) Y/ Y (causedBy) X

X (instrumentFor) Y / Y (performedByInstrument) X

X (processFor) Y / Y (usesProcess) X

X (beneficialFor) Y / Y (benefitsFrom) X

X (treatmentFor) Y / Y (treatedWith) X

X (harmfulFor) Y / Y (harmedBy) X

X (hasPest) Y / Y (afflicts) X

X (growsIn) Y / Y (growthEnvironmentFor) X

X (hasProperty) Y / Y (propertyOf) X

X (hasSymptom) Y / Y (indicates) X

X (similarTo) Y / Y (similarTo) X

X (oppositeTo) Y / Y (oppositeTo) X

X (hasPhase) Y / Y (phaseOf) X

X (growsIn) Y / Y (EnvironmentForGrowing) X

X (ingests) Y / Y (ingestedBy) Y

Clearly, we are at the start of a long period of experimentation and development in the evolution of knowledge organisation systems.

References

Nichols, D. and Terry, A. User's guide to Teknowledge ontologies (http://ontology.teknowledge.com).

Soergel, D, et al, Journal of digital information, vol 4, issue 4, article 257 (http://jodi.tamu.edu).

Sowa, J.F. Conceptual structures, Addison-Wesley, 1984.

Tudhope, D. et al. Augmenting thesaurus relationships, Journal of digital information, vol 1, issue 8, article 41 (http://jodi.tamu.edu).

Vickery, B.C. Classification and indexing in science, 3rd edition, Butterworths 1973.

Vickery, B. and A. Online search interface design, in Journal of documentation, vol 49, pp 103-87, 1993.

Back to my home page