Polythetic Classification Definition Essay

For the term as used in anthropology, see Family Resemblance (anthropology).

Family resemblance (German: Familienähnlichkeit) is a philosophical idea made popular by Ludwig Wittgenstein, with the best known exposition given in his posthumously published book Philosophical Investigations (1953).[1] It argues that things which could be thought to be connected by one essential common feature may in fact be connected by a series of overlapping similarities, where no one feature is common to all of the things. Games, which Wittgenstein used as an example to explain the notion, have become the paradigmatic example of a group that is related by family resemblances. It has been suggested that Wittgenstein picked up the idea and the term from Nietzsche, who had been using it, as did many nineteenth century philologists, when discoursing about language families.[2]

The first occurrence of the term "Family resemblance" is found in a note from 1930, commenting on Spengler's ideas.[3] The notion itself features widely in Wittgenstein's later work, and in the Investigations it is introduced in response to questions about the general form of propositions and the essence of language – questions which were central to Wittgenstein throughout his philosophical career. This suggests that family resemblance was of prime importance for Wittgenstein's later philosophy; however, like many of his ideas, it is hard to find precise agreement within the secondary literature on either its place within Wittgenstein's later thought or on its wider philosophical significance.

Since the publication of the Investigations, the notion of family resemblance has been discussed extensively not only in the philosophical literature, but also, for example, in works dealing with classification where the approach is described as "polythetic", distinguishing it from the traditional approach known now as "monothetic". Prototype theory is a recent development in cognitive science where this idea has also been explored. As the idea gains popularity, earlier instances of its occurrence are rediscovered e.g. in 18th century taxonomy,[4] in the writings of Vygotsky[5] or Tatarkiewicz.[6]

Philosophical context[edit]

The local context where the topic of family resemblances appears is Wittgenstein's critique of language. In Philosophical Investigations §65-71 the plurality of language uses is compared to the plurality of games. Next it is asserted that games have common features but no one feature is found in all of them. The whole argument has become famous under the heading 'language games'.

The larger context in which Wittgenstein's philosophy is seen to develop considers his uncompromising opposition to essences, mental entities and other forms of idealism which were accepted as a matter of fact in continental philosophy at the turn of the preceding century. In his view, the main cause for such errors is language and its uncritical use. In the received view, concepts, categories or classes are taken to rely on necessary features common to all items covered by them. Abstraction is the procedure which acknowledges this necessity and derives essences, but in the absence of a single common feature, it is bound to fail.


The term "Family resemblance" as feature of Wittgenstein's philosophy owes much to its translation in English. Wittgenstein, who wrote mostly in German, used the compound word 'Familienähnlichkeit', but as he lectured and conversed in English he used 'family likeness' (e.g. The Blue Book, p. 17,33; The Brown Book,§66). However, in the Philosophical Investigations the separate word 'Ähnlichkeit' has been translated as 'similarity' (§§11,130,185,444) and on two occasions (§§9,90) it is given as 'like'. The German family-word is common and it is found in Grimm's dictionary; a rare occurrence of 'family likeness' has been noted in a lecture by J. F. Moulton in 1877.[7]

Examples and quotes[edit]

Games are the main example considered by Wittgenstein in his text where he also mentions numbers and makes an analogy with a thread. He develops his argument further by insisting that in such cases there is not a clear cut boundary but there arises some ambiguity if this indefiniteness can be separated from the main point.

In §66 Wittgenstein invites us to

consider for example the proceedings that we call "games"...[to] look and see whether there is anything common to all.

The section mentions card games, board games, ball games, games like ring-a-ring-a-roses and concludes:

And we can go through the many, many other groups of games in the same way; we can see how similarities crop up and disappear.

And the result of this examination is: we see a complicated network of similarities overlapping and criss-crossing: sometimes overall similarities.

The following §67 begins by stating:

I can think of no better expression to characterize these similarities than "family resemblances"; for the various resemblances between members of a family: build, features, colour of eyes, gait, temperament, etc. etc. overlap and criss-cross in the same way. – And I shall say: "games" form a family.

and extends the illustration

for instance the kinds of number form a family in the same way. Why do we call something a "number"? Well, perhaps because it has a direct relationship with several things that have hitherto been called number; and this can be said to give it an indirect relationship to other things we call the same name. And we extend our concept of number as in spinning a thread we twist fibre on fibre. And the strength of the thread does not reside in the fact that some one fibre runs through its whole length, but in the overlapping of many fibres.

The problem of boundaries begins in §68

I can give the concept 'number' rigid limits ... that is, use the word "number" for a rigidly limited concept, but I can also use it so that the extension of the concept is not closed by a frontier. And this is how we do use the word "game". For how is the concept of a game bounded? What still counts as a game and what no longer does? Can you give the boundary? No. You can draw one; for none has so far been drawn. (But that never troubled you before when you used the word "game".)

Formal models[edit]

There are some simple models[5][8] which can be derived from the text of §66-9. The most simple one, which fits Wittgenstein's exposition, seems to be the sorites type. It consists in a collection of items Item_1, Item_2, Item_3... described by features A, B, C, D, ...:

Item_1: A B C D
Item_2: B C D E
Item_3: C D E F
Item_4: D E F G
Item_5: E F G H
......... . . . .

In this example, which presents an indefinitely extended ordered family, resemblance is seen in shared features: each item shares three features with his neighbors e.g. Item_2 is like Item_1 in respects B, C, D, and like Item_3 in respects C, D, E. Obviously what we call 'resemblance' involves different aspects in each particular case. It is also seen to be of a different 'degree' and here it fades with 'distance': Item_1 and Item_5 have nothing in common.

Another simple model is described as:

Item_1: A B C
Item_2: B C D
Item_3: A C D
Item_4: A B D
It exhibits the presence of a constant degree of resemblance and the absence of a common feature without extending to infinity.

Wittgenstein rejects the disjunction of features or 'properties', i.e. the set {A,B,C,D,..}, as something shared by all items. He admits that a 'sharing' is common to all but deems that it is only verbal:

if someone wished to say: "There is something common to all these constructions – namely the disjunction of all their common properties" – I should reply: Now you are only playing with words. One might as well say: "Something runs through the whole thread – namely the continuous overlapping of those fibres".

Notable applications[edit]

  • Thomas Kuhn uses Wittgenstein's concept in chapter V ('The Priority of Paradigms) of his famous The Structure of Scientific Revolutions (1962). Paradigms are not reducible to single discoverable sets of scientific rules, but consist of assumptions that relate to other rules that are recognized by parts of a scientific community.[9]
  • Morris Weitz first applied family resemblances in an attempt to describe art.[10] which opened a still[when?] continuing debate.[11]
  • Renford Bambrough proposed that 'Wittgenstein solved what is known as "the problem of universals"' and said of his solution (as Hume said of Berkeley's treatment of the same topic) that it is "one of the greatest and most valuable discoveries that has been made of late years in the republic of letters".[12] His view provided the occasion for numerous further comments.[13]
  • Rodney Needham explored family resemblances in connection with the problem of alliance and noted their presence in taxonomy where they are known as a polythetic classification.[5]
  • Eleanor Rosch used family resemblances in her cognitivist studies.[14] Other cognitive research[15] has shown that children and even rhesus monkeys tend to use family resemblance relationships rather than explicit rules[16] when learning categories.

Game studies[edit]

Wittgenstein's suggestion (PI, §66) about the impossibility of formulating a definition of games portrays a predicament for disciplines, which entail games as their subject matter, because it denies the possibility to know what games are. One possible solution is to point out that Wittgenstein merely acts out his failing attempt to define the concept of game, because he wanted to demonstrate a mechanism of language. He wasn't particularly concerned about games, nor about the concept of 'game', but he was interested in the consequence of a definitory failure. The demonstration aims to show, that there is no reason to search for real definitions, which describe essential attributes of things, but rather nominal definitions, which describe the use of the term in a community. He connected this idea to language games – lingual expressions combined with action – as a more adequate alternative to explain the function of language. Confusing this is his choice to denominate the approach (PI, §7) as 'language games', further fueling the impression that he provides insights about the concept of game. Wittgenstein wasn't interested in games but in language, therefore his theories and examples are only superficially related to academic disciplines with games as subject matter.

Criticism and comments[edit]

Philosophical Investigations is the primary text used in discussing family resemblances, even though the topic appears also in other works by Wittgenstein, notably The Brown Book.[17] Many contributions to the discussion are by people involved in philosophical research but concerned with more pragmatic questions such as taxonomy[4] or information processing.[18]Hans Sluga has observed that "the notion of family resemblance... draws on two quite different sets of ideas, two different vocabularies, but treats them as if they were one and the same. The first is the vocabulary of kinship, of descent, of some sort of real and causal connection.. the second is that of similarity, resemblance, affinity and correspondence."[19]

The main focus for criticism is[citation needed] the notion of similarity, which is instrumental for family resemblance. A similarity is always found for two arbitrarily selected objects, or a series of intermediaries can link them into a family. This problem has been known as underdeterminacy or open ended texture.[citation needed] Admittedly infinity is only potential [clarification needed] but for any finite family some common element can be pointed out, especially if relational properties are taken into consideration. [example needed] Wittgenstein's insistence that boundaries do not really exist but can be traced arbitrarily has been described as conventionalism and more generally the acceptance of his conception has been seen to present a refined nominalism.[20][further explanation needed]

See also[edit]


  1. ^Wittgenstein, Ludwig (2001) [1953]. Philosophical Investigations. Blackwell Publishing. ISBN 0-631-23127-7. 
  2. ^Sluga H., Family Resemlance, Grazer Philosophische Studien 71 (2006) 1; see also A Wittgenstein Dictionary, ed. H.-J. Glock, London:Blackwell 1996
  3. ^Wittgenstein L.,(1998) Culture and Value, London:Blackwell, p 14. Spengler's influence in this and other forms has been considered in papers published after this collection of notes became available, see e.g. DeAngelis W., "Wittgenstein and Spengler," Dialogue 33 (1994):41–61
  4. ^ abWinsor M., 2003, Non-essentialist methods in pre-Darwinian taxonomy, Biology and Philosophy 18 (2003) 387–400
  5. ^ abcNeedham R., 1975, Polythetic classification: Convergence and consequences, Man 10 (1975) 349
  6. ^Tatarkiewicz W., Postawa estetyczna, literacka i poetycka (1933) where it is called 'domino resemblance'.
  7. ^see Griffin, N.: 1974, Wittgenstein, Universals and Family Resemblance, Canadian Journal of Philosophy III,635–651
  8. ^Andersen H.,:2000, Kuhn's account of family resemblance, Erkenntnis 52: 313–337
  9. ^Kuhn, T. (2012) 'The Structure of Scientific Revolutions', p. 45. Fourth (Anniversary) Edition, Chicago: University of Chicago Press 2012.
  10. ^Weitz M., The Role of Theory in Aesthetics, Journal of Aesthetics and Art Criticism 62 (1953): 27.
  11. ^Kaufman D., Family resemblances Relationism and the meaning of "art", British Journal of Aesthetics, vol. 47, No. 3, July 2007 doi:10.1093/aesthj/aym008"Archived copy"(PDF). Archived from the original(PDF) on 2010-05-28. Retrieved 2010-11-08. 
  12. ^Bambrough, R.: 1961, Universals and Family Resemblance, Proc. Aris. Soc. 61, 207–22
  13. ^a recent summary in Blair D. (2006), Wittgenstein, Language and Information, p.118 (note 117); see also Dilman, I.: Universals: Bambrough on Wittgenstein, Aris. Soc. Proc., 79 (1978): 35–58; reprinted in John V. Canfi ed., The Philosophy of Wittgenstein, Vol. 5, Method and Essence, pp. 305–328. New York: Garland Publishing, 1986.
  14. ^Rosch E. and Mervis, C. (1975) Family resemblances: studies in the internal structure of categories, Cognitive Psychology 7, 573-605;
    Rosch, E. (1987), Wittgenstein and categorization research in cognitive psychology, in M. Chapman & R. Dixon (Eds.), Meaning and the growth of understanding. Wittgenstein's significance for developmental Psychology, Hillsdale, NJ.: Erlbaum.
  15. ^Couchman, Justin J.; Coutinho, M. V. C.; Smith, J. D. (2010). "Rules and Resemblance: Their Changing Balance in the Category Learning of Humans (Homo sapiens) and Monkeys (Macaca mulatta)". Journal of Experimental Psychology: Animal Behavior Processes. 36 (2): 172–183. doi:10.1037/a0016748. PMC 2890302. PMID 20384398. 
  16. ^The connection between rule following and applying or extending a concept has been noted early in the discussion of family resemblances, see e.g. Pompa L., 'Family resemblance: a reply', The Philosophical Quarterly, 18 (1968) 347
  17. ^Wittgenstein L., The Blue and Brown Books, London: Blackwell (1958);I 68, 17, 73
  18. ^Blair D., Wittgenstein, Language and Information, Berlin:Springer, 2006, ISBN 978-1-4020-4112-9
  19. ^Sluga H., Family Resemlance, Grazer Philosophische Studien 71 (2006) 14
  20. ^Resemblance Nominalism. In N. Bunnin & J. Yu [Eds.] (2004). The Blackwell Dictionary of Western Philosophy. Accessed Online at: http://www.blackwellreference.com/subscriber/uid=2241/tocnode?id=g9781405106795_chunk_g978140510679519_ss1-101


  • Wittgenstein, Ludwig (2001) [1953]. Philosophical Investigations. Blackwell Publishing. ISBN 0-631-23127-7. 
  • Andersen H.,: 2000, Kuhn's account of family resemblance, Erkenntnis 52: 313–337
  • Bambrough, R.: 1961, Universals and Family Resemblance, Proc. Arist. Soc. 61, 207–22
  • Beardsmore, R. W.: 1992, The Theory of Family Resemblance, Philosophical Investigations 15, 131–146
  • Bellaimey, J. E.: 1990, Family Resemblances and the Problem of the Under-Determination of Extension, Philosophical Investigations 13, 31–43.
  • Drescher, F.: 2017, Analogy in Thomas Aquinas and Ludwig Wittgenstein. A comparison. New Blackfriars. doi:10.1111/nbfr.12273
  • Ginzburg C.,: 2004, Family Resemblances and Family Trees: Two Cognitive Metaphors, Critical Inquiry, Vol. 30, No. 3 (Spring 2004), pp. 537–556
  • Griffin, N.: 1974, Wittgenstein, Universals and Family Resemblance, Canadian Journal of Philosophy III, 635–651.
  • Gupta, R. K.: 1970, Wittgenstein's Theory of "Family Resemblance", in his Philosophical Investigations (Secs. 65–80), Philosophia Naturalis 12, 282–286
  • Huff D.:(1981), Family Resemblances and rule governed behavior, Philosophical Investigations 4 (3) 1–23
  • Kaufman D.: 2007, Family resemblances Relationism and the meaning of "art", British Journal of Aesthetics, vol. 47, No. 3, July 2007, doi:10.1093/aesthj/aym008
  • Prien B.: Family Resemblances-A Thesis about the Change of Meaning over Time, Kriterion 18 (2004), pp. 15–24.
  • Raatzsch R., Philosophical Investigations 65ff. :On Family Resemblance, in Essays on Wittgenstein by P. Philipp and R. Raatzsch, Working papers from the Wittgenstein Archives at the University of Bergen #6 (1993), pp. 50–76
  • Wennerberg, H.: 1967, The Concept of Family Resemblance in Wittgenstein's Later Philosophy, Theoria 33, 107–132.

External links[edit]

  • Lois Shawver's comments on Philosophical Investigations §65-9 [1]


Information retrieval is a wide, often loosely-defined term but in these pages I shall be concerned only with automatic information retrieval systems. Automatic as opposed to manual and information as opposed to data or fact. Unfortunately the word information can be very misleading. In the context of information retrieval (IR), information, in the technical meaning given in Shannon's theory of communication, is not readily measured (Shannon and Weaver[1]). In fact, in many cases one can adequately describe the kind of retrieval by simply substituting 'document' for 'information'. Nevertheless, 'information retrieval' has become accepted as a description of the kind of work published by Cleverdon, Salton, Sparck Jones, Lancaster and others. A perfectly straightforward definition along these lines is given by Lancaster[2]: 'Information retrieval is the term conventionally, though somewhat inaccurately, applied to the type of activity discussed in this volume. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.' This specifically excludes Question-Answering systems as typified by Winograd[3] and those described by Minsky[4]]. It also excludes data retrieval systems such as used by, say, the stock exchange for on-line quotations.

To make clear the difference between data retrieval (DR) and information retrieval (IR), I have listed in Table 1.1 some of the distinguishing properties of data and information retrieval. One

Table 1.1


Data Retrieval

(DR) Information Retrieval (IR)
Matching Exact match Partial match, best match

Inference Deduction Induction

Model Deterministic Probabilistic

Classification Monothetic Polythetic

Query language Artificial Natural

Query specification Complete Incomplete

Items wanted Matching Relevant

Error response Sensitive Insensitive

may want to criticise this dichotomy on the grounds that the boundary between the two is a vague one. And so it is, but it is a useful one in that it illustrates the range of complexity associated with each mode of retrieval.

Let us now take each item in the table in turn and look at it more closely. In data retrieval we are normally looking for an exact match, that is, we are checking to see whether an item is or is not present in the file. In information retrieval this may sometimes be of interest but more generally we want to find those items which partially match the request and then select from those a few of the best matching ones.

The inference used in data retrieval is of the simple deductive kind, that is, aRb and bRc then aRc. In information retrieval it is far more common to use inductive inference; relations are only specified with a degree of certainty or uncertainty and hence our confidence in the inference is variable. This distinction leads one to describe data retrieval as deterministic but information retrieval as probabilistic. Frequently Bayes' Theorem is invoked to carry out inferences in IR, but in DR probabilities do not enter into the processing.

Another distinction can be made in terms of classifications that are likely to be useful. In DR we are most likely to be interested in a monothetic classification, that is, one with classes defined by objects possessing attributes both necessary and sufficient to belong to a class. In IR such a classification is one the whole not very useful, in fact more often a polythetic classification is what is wanted. In such a classification each individual in a class will possess only a proportion of all the attributes possessed by all the members of that class. Hence no attribute is necessary nor sufficient for membership to a class.

The query language for DR will generally be of the artificial kind, one with restricted syntax and vocabulary, in IR we prefer to use natural language although there are some notable exceptions. In DR the query is generally a complete specification of what is wanted, in IR it is invariably incomplete. This last difference arises partly from the fact that in IR we are searching for relevant documents as opposed to exactly matching items. The extent of the match in IR is assumed to indicate the likelihood of the relevance of that item. One simple consequence of this difference is that DR is more sensitive to error in the sense that, an error in matching will not retrieve the wanted item which implies a total failure of the system. In IR small errors in matching generally do not affect performance of the system significantly.

Many automatic information retrieval systems are experimental. I only make occasional reference to operational systems. Experimental IR is mainly carried on in a 'laboratory' situation whereas operational systems are commercial systems which charge for the service they provide. Naturally the two systems are evaluated differently. The 'real world' IR systems are evaluated in terms of 'user satisfaction' and the price the user is willing to pay for its service. Experimental IR systems are evaluated by comparing the retrieval experiments with standards specially constructed for the purpose. I believe that a book on experimental information retrieval, covering the design and evaluation of retrieval systems from a point of view which is independent of any particular system, will be a great help to other workers in the field and indeed is long overdue.

Many of the techniques I shall discuss will not have proved themselves incontrovertibly superior to all other techniques, but they have promise and their promise will only be realised when they are understood. Information about new techniques has been so scattered through the literature that to find out about them you need to be an expert before you begin to look. I hope that I will be able to take the reader to the point where he will have little trouble in implementing some of the new techniques. Also, that some people will then go on to experiment with them, and generate new, convincing evidence of their efficiency and effectiveness.

My aim throughout has been to give a complete coverage of the more important ideas current in various special areas of information retrieval. Inevitably some ideas have been elaborated at the expense of others. In particular, emphasis is placed on the use of automatic classification techniques and rigorous methods of measurement of effectiveness. On the other hand, automatic content analysis is given only a superficial coverage. The reasons are straightforward, firstly the material reflects my own bias, and secondly, no adequate coverage of the first two topics has been given before whereas automatic content analysis has been documented very well elsewhere. A subsidiary reason for emphasising automatic classification is that little appears to be known or understood about it in the context of IR so that research workers are loath to experiment with it.

The structure of the book

The introduction presents some basic background material, demarcates the subject and discusses loosely some of the problems in IR. The chapters that follow cover topics in the order in which I would think about them were I about to design an experimental IR system. They begin by describing the generation of machine representations for the information, and then move on to an explanation of the logical structures that may be arrived at by clustering. There are numerous methods for representing these structures in the computer, or in other words, there is a choice of file structures to represent the logical structure, so these are outlined next. Once the information has been stored in this way we are able to search it, hence a discussion of search strategies follows. The chapter on probabilistic retrieval is an attempt to create a formal model for certain kinds of search strategies. Lastly, in an experimental situation all of the above will have been futile unless the results of retrieval can be evaluated. Therefore a large chapter is devoted to ways of measuring the effectiveness of retrieval. In the final chapter I have indulged in a little speculation about the possibilities for IR in the next decade.

The two major chapters are those dealing with automatic classification and evaluation. I have tried to write them in such a way that each can be read independently of the rest of the book (although I do not recommend this for the non-specialist).


Chapter 2: Automatic Text Analysis

- contains a straightforward discussion of how the text of a document is represented inside a computer. This is a superficial chapter but I think it is adequate in the context of this book.

Chapter 3: Automatic Classification

- looks at automatic classification methods in general and then takes a deeper look at the use of these methods in information retrieval.

Chapter 4: File Structures

- here we try and discuss file structures from the point of view of someone primarily interested in information retrieval.

Chapter 5: Search Strategies

- gives an account of some search strategies when applied to document collections structured in different ways. It also discusses the use of feedback.

Chapter 6: Probabilistic Retrieval

- describes a formal model for enhancing retrieval effectiveness by using sample information about the frequency of occurrence and co-occurrence of index terms in the relevant and non-relevant documents.

Chapter 7: Evaluation

- here I give a traditional view of the measurement of effectiveness followed by an explanation of some of the more promising attempts at improving the art. I also attempt to provide foundations for a theory of evaluation.

Chapter 8: The Future

- contains some speculation about the future of IR and tries to pinpoint some areas of research where further work is desperately needed.

Information retrieval

Since the 1940s the problem of information storage and retrieval has attracted increasing attention. It is simply stated: we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. In libraries, many of which certainly have an information storage and retrieval problem, some of the more mundane tasks, such as cataloguing and general administration, have successfully been taken over by computers. However, the problem of effective retrieval remains largely unsolved.

In principle, information storage and retrieval is simple. Suppose there is a store of documents and a person (user of the store) formulates a question (request or query) to which the answer is a set of documents satisfying the information need expressed by his question. He can obtain the set by reading all the documents in the store, retaining the relevant documents and discarding all the others. In a sense, this constitutes 'perfect' retrieval. This solution is obviously impracticable. A user either does not have the time or does not wish to spend the time reading the entire document collection, apart from the fact that it may be physically impossible for him to do so.

When high speed computers became available for non-numerical work, many thought that a computer would be able to 'read' an entire document collection to extract the relevant documents. It soon became apparent that using the natural language text of a document not only caused input and storage problems (it still does) but also left unsolved the intellectual problem of characterising the document content. It is conceivable that future hardware developments may make natural language input and storage more feasible. But automatic characterisation in which the software attempts to duplicate the human process of 'reading' is a very sticky problem indeed. More specifically, 'reading' involves attempting to extract information, both syntactic and semantic, from the text and using it to decide whether each document is relevant or not to a particular request. The difficulty is not only knowing how to extract the information but also how to use it to decide relevance. The comparatively slow progress of modern linguistics on the semantic front and the conspicuous failure of machine translation (Bar-Hillel[5]) show that these problems are largely unsolved.

The reader will have noticed that already, the idea of 'relevance' has slipped into the discussion. It is this notion which is at the centre of information retrieval. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. When the characterisation of a document is worked out, it should be such that when the document it represents is relevant to a query, it will enable the document to be retrieved in response to that query. Human indexers have traditionally characterised documents in this way when assigning index terms to documents. The indexer attempts to anticipate the kind of index terms a user would employ to retrieve each document whose content he is about to describe. Implicitly he is constructing queries for which the document is relevant. When the indexing is done automatically it is assumed that by pushing the text of a document or query through the same automatic analysis, the output will be a representation of the content, and if the document is relevant to the query, a computational procedure will show this.

Intellectually it is possible for a human to establish the relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.

An information retrieval system

Let me illustrate by means of a black box what a typical IR system would look like. The diagram shows three components: input, processor and output. Such a trichotomy may seem a little trite, but the components constitute a convenient set of pegs upon which to hang a discussion.

Starting with the input side of things. The main problem here is to obtain a representation of each document and query suitable for a computer to use. Let me emphasise that most computer-based retrieval systems store only a representation of the document (or query) which means that the text of a document is lost once it has been processed for the purpose of generating its representation. A document representative could, for example, be a list of extracted words considered to be significant. Rather than have the computer process the natural language, an alternative approach is to have an artificial language within which all queries and documents can be formulated. There

is some evidence to show that this can be effective (Barber et al.[6]). Of course it presupposes that a user is willing to be taught to express his information need in the language.

When the retrieval system is on-line, it is possible for the user to change his request during one search session in the light of a sample retrieval, thereby, it is hoped, improving the subsequent retrieval run. Such a procedure is commonly referred to as feedback. An example of a sophisticated on-line retrieval system is the MEDLINE system (McCarn and Leiter[7]). I think it is fair to say that it will be only a short time before all retrieval systems will be on-line.

Secondly, the processor, that part of the retrieval system concerned with the retrieval process. The process may involve structuring the information in some appropriate way, such as classifying it. It will also involve performing the actual retrieval function, that is, executing the search strategy in response to a query. In the diagram, the documents have been placed in a separate box to emphasise the fact that they are not just input but can be used during the retrieval process in such a way that their structure is more correctly seen as part of the retrieval process.

Finally, we come to the output, which is usually a set of citations or document numbers. In an operational system the story ends here. However, in an experimental system it leaves the evaluation to be done.

IR in perspective

This section is not meant to constitute an attempt at an exhaustive and complete account of the historical development of IR. In any case, it would not be able to improve on the accounts given by Cleverdon[8] and Salton[9]]. Although information retrieval can be subdivided in many ways, it seems that there are three main areas of research which between them make up a considerable portion of the subject. They are: content analysis, information structures, and evaluation. Briefly the first is concerned with describing the contents of documents in a form suitable for computer processing; the second with exploiting relationships between documents to improve the efficiency and effectiveness of retrieval strategies; the third with the measurement of the effectiveness of retrieval.

Since the emphasis in this book is on a particular approach to document representation, I shall restrict myself here to a few remarks about its history. I am referring to the approach pioneered by Luhn[10]. He used frequency counts of words in the document text to determine which words were sufficiently significant to represent or characterise the document in the computer (more details about this in the next chapter). Thus a list of what might be called 'keywords' was derived for each document. In addition the frequency of occurrence of these words in the body of the text could also be used to indicate a degree of significance. This provided a simple weighting scheme for the 'keywords' in each list and made available a document representative in the form of a 'weighted keyword description'.

At this point, it may be convenient to elaborate on the use of 'keyword'. It has become common practice in the IR literature to refer to descriptive items extracted from text as keywords or terms. Such items are often the outcome of some process such as, for example, the gathering together of different morphological variants of the same word. In this book, keyword and term will be used interchangeably.

The use of statistical information about distributions of words in documents was further exploited by Maron and Kuhns[11] and Stiles[12] who obtained statistical associations between keywords. These associations provided a basis for the construction of a thesaurus as an aid to retrieval. Much of this early research was brought together with the publication of the 1964 Washington Symposium on Statistical Association Methods for Mechanized Documentation (Stevens et al. [13]).

Sparck Jones has carried on this work using measures of association between keywords based on their frequency of co-occurrence (that is, the frequency with which any two keywords occur together in the same document). She has shown[14] that such related words can be used effectively to improve recall, that is, to increase the proportion of the relevant documents which are retrieved. Interestingly, the early ideas of Luhn are still being developed and many automatic methods of characterisation are based on his early work.

The term information structure (for want of better words) covers specifically a logical organisation of information, such as document representatives, for the purpose of information retrieval. The development in information structures has been fairly recent. The main reason for the slowness of development in this area of information retrieval is that for a long time no one realised that computers would not give an acceptable retrieval time with a large document set unless some logical structure was imposed on it. In fact, owners of large data-bases are still loath to try out new organisation techniques promising faster and better retrieval. The slowness to recognise and adopt new techniques is mainly due to the scantiness of the experimental evidence backing them. The earlier experiments with document retrieval systems usually adopted a serial file organisation which, although it was efficient when a sufficiently large number of queries was processed simultaneously in a batch mode, proved inadequate if each query required a short real time response. The popular organisation to be adopted instead was the inverted file. By some this has been found to be restrictive (Salton[15]). More recently experiments have attempted to demonstrate the superiority of clustered files for on-line retrieval.

The organisation of these files is produced by an automatic classification method. Good[16] and Fairthorne[17] were among the first to suggest that automatic classification might prove useful in document retrieval. Not until several years later were serious experiments carried out in document clustering (Doyle[18]; Rocchio[19]). All experiments so far have been on a small scale. Since clustering only comes into its own when the scale is increased, it is hoped that this book may encourage some large scale experiments by bringing together many of the necessary tools.

Evaluation of retrieval systems has proved extremely difficult. Senko[20] in an excellent survey paper states: 'Without a doubt system evaluation is the most troublesome area in ISR ...', and I am inclined to agree. Despite excellent pioneering work done by Cleverdon et al.[21] in this area, and despite numerous measures of effectiveness that have been proposed (see Robertson[22, 23 ]for a substantial list), a general theory of evaluation had not emerged. I attempt to provide foundations for such a theory in Chapter 7 (page 168).

In the past there has been much debate about the validity of evaluations based on relevance judgments provided by erring human beings. Cuadra and Katter[24]supposed that relevance was measurable on an ordinal scale (one which arises from the operation of rank-ordering) but showed that the position of a document on such a scale was affected by external variables not usually controlled in the laboratory. Lesk and Salton[25] subsequently showed that a dichotomous scale on which a document is either relevant or non-relevant, when subjected to a certain probability of error, did not invalidate the results obtained for evaluation in terms of precision (the proportion of retrieved documents which are relevant) and recall(the proportion of relevant documents retrieved). Today effectiveness of retrieval is still mostly measured in terms of precision and recall or by measures based thereon. There is still no adequate statistical treatment showing how appropriate significance tests may be used (I shall return to this point in the Chapter on Evaluation, page 178). So, after a few decades of research in this area we basically have only precision and recall, and a working hypothesis which states, quoting Cleverdon[26]: 'Within a single system, assuming that a sequence of sub-searches for a particular question is made in the logical order of expected decreasing precision, and the requirements are those stated in the question, there is an inverse relationship between recall and precision, if the results of a number of different searches are averaged.'

Effectiveness and efficiency

Much of the research and development in information retrieval is aimed at improving the effectiveness and efficiency of retrieval. Efficiency is usually measured in terms of the computer resources used such as core, backing store, and C.P.U. time. It is difficult to measure efficiency in a machine independent way. In any case, it should be measured in conjunction with effective-ness to obtain some idea of the benefit in terms of unit cost. In the previous section I mentioned that effectiveness is commonly measured in terms of precision and recall. I repeat here that precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved, and recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents (both retrieved and not retrieved). The reason for emphasising these two measures is that frequent reference is made to retrieval effectiveness but its detailed discussion is delayed until Chapter 7. It will suffice until we reach that chapter to think of retrieval effectiveness in terms of precision and recall. It would have been possible to give the chapter on evaluation before any of the other material but this, in my view, would have been like putting the cart before the horse. Before we can appreciate the evaluation of observations we need to understand what gave rise to the observations. Hence I have delayed discussing evaluation until some understanding of what makes an information retrieval system tick has been gained. Readers not satisfied with this order can start by first reading Chapter 7 which in any case can be read independently.

Bibliographic remarks

The best introduction to information retrieval is probably got by reading some of the early papers in the field. Luckily many of these have now been collected in book form. I recommend for browsing the books edited by Garvin[27], Kochen[28], Borko[29], Schecter[30 ]and Saracevic[31]. It is also worth noting that some of the papers cited in this book may be found in one of these collections and therefore be readily accessible. A book which is well written and can be read without any mathematical background is one by Lancaster[2]. More recently, a number of books have come out entirely devoted to information retrieval and allied topics, they are Doyle[32], Salton[33], Paice[34], and Kochen[35]. In particular, the latter half of Doyle's book makes interesting reading since it describes what work in IR was like in the early days (the late 1950s to early 1960s). A critical view of information storage and retrieval is presented in the paper by Senko[20]. This paper is more suitable for people with a computer science background, and is particularly worth reading because of its healthy scepticism of the whole subject. Readers more interested in information retrieval in a library context should read Vickery[36].

One early publication worth reading which is rather hard to come by is the report on the Cranfield II project by Cleverdon et al.[21]. This report is not really introductory material but constitutes, in my view, one of the milestones in information retrieval. It is an excellent example of the experimental approach to IR and contains many good ideas which have subsequently been elaborated in the open literature. Time spent on this report is well spent.

Papers on information retrieval have a tendency to get published in journals on computer science and library science. There are, however, a few major journals which are largely devoted to information retrieval. These are, Journal of Documentation, Information Storage and Retrieval*, and Journal of the American Society for Information Science.

Finally, every year a volume in the series Annual Review of Information Science and Technology is edited by C. A. Cuadra. Each volume attempts to cover the new work published in information storage and retrieval for that year. As a source of references to the current literature it is unsurpassed. But they are mainly aimed at the practitioner and as such are a little difficult to read for the uninitiated.


1. SHANNON, C.E. and WEAVER, W., The Mathematical Theory of Communication, University of Illinois Press, Urbana (1964).

2. LANCASTER, F.W., Information Retrieval Systems: Characteristics, Testing and Evaluation, Wiley, New York (1968).

3. WINOGRAD, T., Understanding Natural Language, Edinburgh University Press, Edinburgh (1972).

4. MINSKY, M., Semantic Information Processing, MIT Press, Cambridge, Massachusetts (1968).

5. BAR-HILLEL, Y., Language and Information. 'Selected Essays on their Theory and Application, Addison-Wesley, Reading, Massachusetts (1964).

6. BARBER, A.S., BARRACLOUGH, E.D. and GRAY, W.A. 'On-line information retrieval as a scientist's tool', Information Storage and Retrieval, 9, 429-44- (1973).

7. McCARN, D.B. and LEITER, J., 'On-line services in medicine and beyond', Science,181, 318-324 (1973).

8. CLEVERDON, C.W., 'Progress in documentation. Evaluation of information retrieval systems', Journal of Documentation, 26, 55-67, (1970).

9. SALTON, G., 'Automatic text analysis', Science,168, 335-343 (1970).

10. LUHN, H.P., 'A statistical approach to mechanised encoding and searching of library information', IBM Journal of Research and Development,1, 309-317 (1957).

11. MARON, M.E. and KUHNS, J.L., 'On relevance, probabilistic indexing and information retrieval', Journal of the ACM,7, 216-244 (1960).

12. STILES, H.F., 'The association factor in information retrieval', Journal of the ACM,8, 271-279 (1961).

13. STEVENS, M.E., GIULIANO, V.E. and HEILPRIN, L.B., Statistical Association Methods for Mechanised Documentation, National Bureau of Standards, Washington (1964).

14. SPARCK JONES, K., Automatic Keyword Classification for Information Retrieval, Butterworths, London (1971).

15. SALTON, G., Paper given at the 1972 NATO Advanced Study Institute for on-line mechanised information retrieval systems (1972).

16. GOOD, I.J., 'Speculations concerning information retrieval', Research Report PC-78, IBM Research Centre, Yorktown Heights, New York (1958).

17. FAIRTHORNE, R.A., 'The mathematics of classification', Towards Information Retrieval, Butterworths, London, 1-10 (1961).

18. DOYLE, L.B., 'Is automatic classification a reasonable application of statistical analysis of text?', Journal of the ACM,12, 473-489 (1965).

19. ROCCHIO, J.J., 'Document retrieval systems - optimization and evaluation', Ph.D. Thesis, Harvard University. Report ISR-10 to National Science Foundation, Harvard Computation Laboratory (1966).

20. SENKO, M.E., 'Information storage and retrieval systems'. In Advances in Information Systems Science, (Edited by J. Tou) Plenum Press, New York (1969).

21. CLEVERDON, C.W., MILLS, J. and KEEN, M., Factors Determining the Performance of Indexing Systems, Vol. 1, Design, Vol.II, Test Results, ASLIB Cranfield Project, Cranfield (1966).

22. ROBERTSON, S.E., 'The parameter description of retrieval tests', Part 1; The basic parameters, Journal of Documentation,25, 11-27 (1969).

23. ROBERTSON, S.E., 'The parameter description of retrieval tests', Part 2; Overall measures, Journal of Documentation,25, 93-107 (1969).

24. CUADRA, A.C. and KATTER, R.V., 'Opening the black box of "relevance"', Journal of Documentation,23, 291-303 (1967).

25. LESK, M.E. and SALTON, G., 'Relevance assessments and retrieval system evaluation', Information Storage and Retrieval,4, 343-359 (1969).

26. CLEVERDON, C.W., 'On the inverse relationship of recall and precision', Journal of Documentation,28, 195-201 (1972).

27. GARVIN, P.L., Natural Language and the Computer, McGraw-Hill, New York (1963).

28. KOCHEN, M., The Growth of Knowledge - Readings on Organisation and Retrieval of Information, Wiley, New York (1967).

29. BORKO, H., Automated Language Processing, Wiley, New York (1967).

30. SCHECTER, G. Information Retrieval: A Critical View, Academic Press, London (1967).

31. SARACEVIC, T., Introduction to Information Science, P.R. Bowker, New York and London (1970).

32. DOYLE, L.B., Information Retrieval and Processing, Melville Publishing Co., Los Angeles, California (1975).

33. SALTON, G., Dynamic Information and Library Processing, Prentice-Hall, Englewoods Cliffs, N.J. (1975).

34. PAICE, C.D., Information Retrieval and the Computer, Macdonald and Jane's, London (1977).

35. KOCHEN, M., Principles of Information Retrieval, Melville Publishing Co., Los Angeles, California (1974).

36. VICKERY, B.C., Techniques of Information Retrieval, Butterworths, London (1970).

Next Chapter: Automatic Text Analysis

Back to Preface and Contents

0 thoughts on “Polythetic Classification Definition Essay”


Leave a Comment

Your email address will not be published. Required fields are marked *