The opinions expressed in this report are those of the authors and not necessarily those of the British Library.
RIC/G/342
ISBN 0 7213 9710 8
ISSN 1366-8218
British Library Research and Innovation Reports are published
by the British Library Research and Innovation Centre and may be purchased
as photocopies or microfiche from the British Thesis Service, British Library
Document Supply Centre, Boston Spa, Wetherby, West Yorkshire LS23 7BQ,
UK.
2.1. Related Items and Bibliographic Relationships
2.2. Large retrievals and Complex retrievals
3. BOPAC2 Design and Development
3.1. Use of Existing Z39.50 Packages
3.3. Overall System Architecture
3.4. The Europagate WWW-to-Z39.50 Gateway
3.5. Availability and Capability of Z39.50 Targets
3.6. Java Applet Design and Development
4. Large and Complex Sets in Z39.50
4.1. The Networked Environment
4.4. Choosing Specific Records with Z39.50
4.5. Other Facilities for Reducing the Quantity of Material Transmitted
4.6. Modelling the BOPAC2 approach in Z39.50
5.1. Search and Retrieval (Europagate software)
5.2. Viewing Retrievals with the Java applet
6. Evaluation and User Feedback
6.1. Comparison with other OPAC's
6.3. Evaluation with Testers and Feedback Questionnaire
The Z39.50 protocol for communicating with databases provides a uniform means of querying remote databases. This is a general purpose protocol, but its use with bibliographic databases for the transfer of MARC records has been the leading application area in the protocol's development. It has commonly been seen as a means by which one can query a remote database but using a more familiar local interface. Z39.50 server software is now often being provided with modern library database systems.
This means that many catalogues are now much more easily accessible. The problem remains of finding an effective way to search them.
The system was originally planned to be a PC based Z39.50 client with a graphical front end, developed from that built for BOPAC1. The growth of the World Wide Web and Java as a powerful programming language for Web applications led the Project to utilise the general design features and lessons of BOPAC1 but use them within a Web based application to make it more widely available.
Making BOPAC2 a World Wide Web application has opened it up to a wider audience than is possible with many research projects. The system has been announced on a number of mailing lists and via a number of different Web sites. It has also been demonstrated to a number of groups. The system will remain available for use after the completion of the Project and a version of this document will also be available on the WWW.
These are available as links from the BOPAC Home Page [1]
The Project's aim was to investigate the issues of large and complex retrievals from Z39.50 searches. The Project's workplan was based on developing work and ideas from the BOPAC1 project. The original workplan envisaged a sequence of: Surveying existing Z39.50 clients; System design and development for retrievals from a single target; Testing and evaluation of single target retrievals; System design and development for retrievals from multiple target; Testing and evaluation of multiple target retrievals. The original workplan was aimed at the development of a PC based client system with a graphical user interface, on the lines of that from BOPAC1. Although a subsidiary project task was the investigation of alternative architectures for the system.
In the first months of the Project, whilst evaluating existing Z39.50 software it became clear that a number of important developments were taking place that the Project needed to take account of. These were the growth of the WWW and in particular library catalogues and other databases (such as search engines) availability on the Web. Allied to this was the appearance of Java as a powerful tool for developing Web based applications. A specific development was the release of the Z39.50 - Web gateway software from the Europagate project which supported retrievals from multiple targets. Previously multiple target support had not been easily available and had hence dictated the workplan.
In light of the developments above, which are explained in detail in other sections of the report, a revised workplan was developed. This entailed modification of the Europagate software, so that single and multiple targets would be supported from the start. And the system was to be built in Java, which would enable a system with the functionality originally envisaged to be available via the WWW. This would mean that the system could be made much more widely available for testing. These revised plans and progress on them were presented to the advisory committee in early 1997.
By Spring 1997 a working version of the system was made available on the WWW, but its URL was not made public. This enabled the Project to elicit feedback from a number of interested experts with particular library and cataloguing expertise. This was input into the system development process until Autumn 1997. From Autumn 1997 till the end of the Project, further system development, apart from bug fixes, was suspended to provide a fixed basis for testing and evaluation. In this phase, the system was made public by announcing it on a number of mailing lists, newsgroups and Web sites and an online questionnaire provided. During the course of the Project we had been experimenting with and monitoring a number of Z39.50 targets.
The set of targets was also kept constant during the test period. One important factor in the original workplan had been testing with users at Bradford. To allow this to take place Z39.50 software had been installed on the Bradford University Library system at the end of 1996. However this system was not in a satisfactory operational state till Summer 1997. The Project extension allowed us to have an extended period of testing with Bradford users. This was done with a tailored front end making the Bradford and Leeds University libraries and British Library Document Supply Centre catalogues available, a similar front end was also installed for Leeds users.
The system has been widely used in the course of the Project, with user feedback from USA, Australia, and Japan as well as across Europe and will remain in place and operational after the end of the Project. In the course of the Project, the system was demonstrated at a number of events including the Conference on Principles and Development of AACR in Toronto. The Project team also met and corresponded with a number of other projects working in similar areas such as ONE, UNIVerse and Riding projects.
There is also the argument that the general idea of the second objective (i.e. the bringing together of related items) is even more important now than it was when it was first posited. The ever-growing size of catalogues, the globalisation and combining of union catalogues, new media, and the networked information infrastructure which facilitates distributed electronic documents; all these factors make it all the more important for the user to be able to identify related items [8]. A number of different bibliographic relationships have been identified [9] any of which may be of relevance to the end-user looking for related items. These relationships have been expressed within the cataloguing rules in various different ways [10]. In addition, there is a great deal of bibliographic relationship information buried in the general notes tag (500). Thus the problem of machine-extraction of bibliographic relationship information from the current stock of MARC records is a difficult one. This has lead to calls for a fundamental re-appraisal of the catalogue and the cataloguing process with a view to the incorporation of effective linking information into the catalogue record ([11], [12], [13], [15], [16] and [17]). The ambiguities and changes in cataloguing practice with respect to collocation make it difficult for existing catalogues to fulfil the second objective as it was surely intended [14].
Meanwhile the ground beneath our feet is shifting as the Z39.50 information retrieval protocol begins to encroach into the world of library catalogues. Until recently Europe (including the UK) has lagged behind the USA in the implementation of Z39.50 [53] but it is beginning to catch up now and there are a number of Projects taking place [55]. In many cases Z39.50 is being installed as a means of inter-operating catalogues to create a distributed library in which the constituent catalogues can be searched in tandem [54]. The current generation of Z39.50 targets operating in the UK have few if any facilities for collocating their retrieval set. Version 3 of Z39.50 does allow the retrieval to be sorted, which at least brings it up to roughly the same level as most conventional OPAC's, but no further.
Analysis of OPAC usage has shown that large numbers of hits are frequently generated [18] and that this causes problems for users. It is a problem not just for subject-based searches but also for title/author searches or known item searches. In general, conventional OPAC's lack good facilities to process large retrievals. Users can limit the retrieval by some secondary filter criteria such as language or date of publication, but often this part of the interface is not very friendly. In particular, they are not usually told enough about the retrieval to be able to formulate a sensible filter. For example a user may want to filter the retrieval by publication date but he/she cannot find out what dates are represented in the retrieval. Experimental systems such as OASIS [19] provide the user with information about their retrieval and have a feedback mechanism to enable them to reduce it better.
Martha Yee [4] points out that "A number of cataloguing theorists including Lubetzky have argued that known-item searches are looking for a work rather than a particular edition of a work.". Yet retrievals (even from a known item author/title search) may contain many works, only some of which are closely related to the required item [14]. Therefore the obvious way to reduce the size of a known-item retrieval is by organising or restricting it according to how closely the items relate to the required item, so that the user can home in on what they want. Indeed Hickey described a system which assists the user by outlining the retrieval's overall structure into a hierarchical tree and found (not surprisingly) that it works best for bibliographically closely related items [20]. Svenonius also outlined how an OPAC display might cluster equivalent records [52]
If a record has several values in a MARC tag (i.e. a repeating subtag) then each value has its own cluster. For example a record with two publishers will have two clusters pointing to it: one for one publisher and one for the other. Sometimes clusters may be constructed by spanning across several MARC subtags. For example, a record with a main entry author (100) and an added entry author (700) may have two author clusters pointing to it: one from the main entry and one for the added entry (see below). These examples show one of the key features of this clustering approach: it can provide several different routes to the same record.
BOPAC2 draws together records where the contents of the fields match. In comparing the fields only alphanumeric characters are significant; spaces and punctuation marks are ignored, as is the case of the letters. This is a not a particularly sophisticated matching technique but it does ride over many minor formatting anomalies. It must be stressed that for this Project clusters are derived from the bibliographic records by algorithms. In BOPAC2 clusters are a display entity, not a cataloguing entity. There is a strong argument that bibliographic records should be linked more effectively, and that such links would improve the interface on systems such as BOPAC2 which could use the links, but this was not the prime purpose of the BOPAC2 Project. BOPAC2 clusters are not always perfect. Clusters convey information about the retrieval and provide a way to navigate through it. Often, the process of building clusters of records which share a common value serves to highlight the odd ones which ought to be in the cluster but are not (perhaps because of a breakdown in authority control).
BOPAC2 does make use of the main entry in creating its work clusters (see below).
These algorithms make use of the implicit prioritisation of the main entry/added entry corresponding fields in the MARC record. An unusual feature is that work author and title are derived from the bibliographic record using the title and author query terms. The display is adapted to some extent to match the users' query. This means that the work title and author technically apply only to one particular retrieval. In another retrieval the same record could (theoretically) produce a different work title or author. In practice, this tends to happen if the record is retrieved in one search under uniform title, and then in another search under title main entry - i.e. where either one or other title meets the user's criteria.
Both work title and author fields are derived in two steps:
Step 2 takes the first available tag from uniform title then main entry title.
Thus uniform title takes priority over main entry title unless the main entry title matches the title query term.
Step 2 takes the first available tag from main entry personal author, corporate author then conference, followed by added entry personal, corporate author then conference.
If there is a title query term the following subtags are tested in sequence until one matches:
---------------------- heading selection --------------------- The following titles have been found (total 8): No. works hdg 1. 1 [T] Chemical engineering 2. 18 [T] Chemical engineering 3. 1 [T] Chemical engineering : a universal code as an aid to chemical systematics 4. 1 [T] Chemical engineering 5. 1 [T] Chemical engineering : introductory aspects 6. 1 [T] Chemical engineering : an introduction 7. 1 [T] Chemical engineering 8. 7 [T] Chemical engineeringIt is far from clear what the user is supposed to make of this. In fact, upon examining the full records, it turns out that heading 2 contains manifestations of Coulson and Richardson's "Chemical Engineering". Heading 7 also contains a manifestation of this work.
BOPAC2 differs from these systems, not only from the user's perspective (in terms of the information that is displayed) but more fundamentally in terms of the purpose of the clusters and the way they are constructed. The "collapsed headings" displays shown by other OPAC's go some way towards meeting the display collocation objective, but this is not their only or primary purpose. Such a display (which is in fact quite rare) is essentially a development of the browsable headings display (which is more common). In a list of browsable headings the user is shown a relevant extract from the index of titles or authors (depending on the query) along with numbers of entries under each heading. An example is shown below for a search for title "Hard Times" at Trinity College Dublin.
Title Number of Titles 1. Hard tennis courts : Hints for groundsmen (Title-Gen) 1 2. Hard terms : unemployment and supplementary b... (Title-Gen) 1 3. Hard Texas trail (Title-Gen) 1 4. Hard things and soft things (Title-Gen) 1 5. The hard thorn (Title-Gen) 1 6. The hard time bunch (Title-Gen) 2 >>> 7. Hard times (Title-Gen) 33 8. Hard times : an authoritative text, backgroun... (Title-Gen) 1 9. Hard times: an oral history of the Great Depr... (Title-Gen) 1 10. 'Hard Times' and culture (Title-Gen) 1 11. Hard times and easy terms : and other tales b... (Title-Gen) 1 12. Hard times at Batwing Hall (Title-Gen) 1 13. Hard times by Charles Dickens (Title-Gen) 1Looking at these browsable lists of headings, the legacy of the card catalogue is obvious. What is shown here depends heavily on the cataloguer's choice of main entry and other decisions about which headings to list an item under. In other words, these displays are governed by the filing rules which are of little relevance in the OPAC world. They may assist with retrieval, but are still essentially descriptive and relate to the catalogue as a whole rather than to the retrieval set itself.
BOPAC2's purpose is not to describe the catalogue's filing sequence but to allow the user to investigate the retrieval. Its clusters are created and displayed within the context of the retrieval set itself, rather than the whole catalogue. The information displayed to the user is adjusted to reflect the size and content of the retrieval. The interactive sorting and keyword search make it possible for the user to discover relationships between the records within the retrieval. The clusters and the display are matched to the user's search terms. In a distributed Z39.50 environment, in which the retrieval may contain items from several different databases, this approach is an essential step towards a display which truly fulfils the collocation function of the catalogue.
The Project began with a short survey of public-domain Z39.50 component packages, in an attempt to identify one which could be used as the basis for the BOPAC2 application. A survey of Z39.50 clients has been carried out by Andrew Wood [22], although this is more from the perspective of the end-user rather than the developer. From this survey, and other information gathered from the Z39.50 Maintenance Agency Page [23], several likely-looking Z39.50 packages were downloaded (mostly from Juha Hakala's archive of public-domain Z39.50 packages [51]) and installed. The packages examined were:
Although the fundamental concepts explored in BOPAC 1 remain in BOPAC2, the environment and the purpose of the two projects are quite different. Whereas BOPAC 1 contained its own rudimentary search and retrieve mechanism, BOPAC2 uses Z39.50. BOPAC 1 was a prototype demonstrator system. It was decided therefore not to attempt to re-use the prototype software directly. The display features and clustering algorithms developed in BOPAC 1 were the starting point for the design of BOPAC2.
Of the remaining options, option 3 was the preferred choice. Other options (a standalone Windows program, a browser plug-in, or the use of Active X) would have limited BOPAC2 to one platform (Windows) or one web browser. This would have limited accessibility given that UNIX is used widely on the Bradford campus and at many other universities. The Java option would encompass the PC, UNIX and the Mac.
Also, and perhaps more importantly, with the standalone program or the browser plug-in, users would have had to download and install the software for themselves. This would have entailed formally releasing the software and would have made it difficult or impossible to make minor corrections to the code afterwards. The advantage of the Java route is that there is no need for users to download any software by hand. Every time the code is updated, it is automatically refreshed on the user's workstation the next time he/she runs a search. Thus there is no need to "release" the software as such. The software can evolve whilst it is being tested, and problems can be continuously corrected.
This arrangement was also convenient for the Project team; one of whom (Fred Ayres) was employed on a part-time basis and was working from home. Equipped with a modem and web browser he was able to participate fully in the development process and to give his reaction to changes to the software, as well as experimenting with the Z39.50 targets at weekends when the Internet was more responsive.
When the Project was first proposed early in 1996, Java had support from every corner of the computing industry and seemed almost certain to become ubiquitous. It was assumed that during the lifetime of the Project, many libraries and campuses would upgrade their Windows 3.1 systems so that they were capable of running Java. In fact, at the end of the Project, Windows 3.1 is still apparently very common. The future of Java on the PC is rather less certain, with Microsoft and Sun Microsystems wrestling for control of the language, but at the moment it remains an effective cross-platform technology. BOPAC2 was the first Z39.50 origin in the world to use Java. As the Project proceeded, other larger projects such as Willow [24] also developed Java interfaces. CIMI [25] are demonstrating a Java-based target created by Blue Angel Technologies [26] who also have their own Java product, MetaStar. Librarians are even using traditional "green screen" Telnet-based OPAC's via Java [27] which suggests that Java-capable browsers are spreading onto library desktops.
It is also worth pointing out that, unlike conventional Z39.50 clients such as BookWhere [28] or ZNavigator [29], the Java-based architecture of BOPAC2 is ideally suited to an intranet. Network bandwidth, always a problem on the Internet, is more controllable problem over a localised intranet. For libraries, there are strong arguments in favour of using Network Computers rather than a bank of expensive PC's [30].
The Europagate WWW-to-Z39.50 gateway was installed on a separate web server in the Department of Computing. Installation involved changing several lines in the main Makefile to adapt it to the department's UNIX environment. After a few attempts, the gateway was up and running. Initially, the only modification made to the system was to change the list of available Z39.50 targets to include some known UK targets.
Although the simultaneous search component of the software was only experimental, it proved reasonably stable in use. A few bugs in the CGI scripts caused problems but these were fixed with the help of Adam Dickmeiss at Index Data. The Europagate CGI scripts have been created by Index Data using their other toolkits, YAZ and IrTcl [32]. The Tcl scripts use standard Tcl, IrTcl (an extension of Tcl to handle Z39.50 z-associations) plus some extra custom Tcl commands. Modifying these scripts to suit the purposes of the Project was relatively difficult because they were not documented and had to be "reverse-engineered". These difficulties were complicated by the way in which the gateway joins stateless HTTP to stateful Z39.50. It uses a set of concurrent interacting UNIX processes and FIFO's (named pipes) which occasionally refused to "die" if the Tcl scripts were buggy. This meant that bugs in the Tcl scripts could leave rogue processes running on the departmental web server. Following two occasions where these rogue processes crashed the web server machine, it was decided to isolate the Europagate gateway onto its own web server. Since then, it has proved fairly stable.
One problem which was anticipated concerned the evaluation of the system. Users would find it difficult to distinguish those components of the interface which were derived from the Europagate software and those created as part of BOPAC2. The Project would have to evaluate both the Europagate and its own software together since it would be virtually impossible for users to separate these two in their own minds. Users would inevitably respond to the system as a whole. This did indeed prove to be a problem, and some of the feedback that we received was about the Europagate software.
Related to this was the question of the overall "look and feel" of the interface. The Europagate software, as supplied, uses standard HTML forms to control both search and display of results. The Project's assertion was that the display side needed to be genuinely interactive, but that such interactivity cannot be achieved with HTML forms and that a Java applet was more suitable. The result was that the interface was divided into two distinct parts: the first part derived from the Europagate system and the second part the Java applet created within the Project. These two parts remain fairly distinct, and although it is clear how to proceed from one to the other the interface does not seem to "flow" smoothly.
Since BOPAC2 is not specifically about Z39.50 as such, it seemed logical to re-use as much of the existing Z39.50 resources if possible, even if the result was apparently disjointed. Europagate provided all the required Z39.50 communications subsystem plus the search interface. This left the Project free to concentrate its limited time on the core area: the processing and display of the results.
Having made the decision to use Europagate, and to use as much of it as possible, the overall structure became clear. It would start as a standard WWW-to-Z39.50 gateway installed and running in the Department of Computing at the University of Bradford. It would communicate with users via the Web and a Java applet, and with Z39.50 targets via Z39.50.
The Europagate software, and most of the targets used operated according to Z39.50-1992 rather than Z39.50-1995. Some facilities of Z39.50-1995 were available on some targets but since the Project was concerned with searches across several targets, and since the Europagate software offers few of the new features from the 1995 standard, Z39.50-1992 was the lowest common denominator. Whilst some targets in the USA (e.g. the Library of Congress) offer a spectacular range of access points, others are more limited. The Z39.50 target at Bradford University, for example, can search only personal authors, not corporate authors or conferences.
The Europagate software, when working across multiple targets, chooses one of them and presents the user with its access points (Bib1 Use attributes). The hope is that these access points can be applied to the other targets in the search, but of course this is not always the case. Faced with a search across two targets, X and Y, the Europagate software may mask out access points from target X if they are not supported by target Y. Or it may offer access points from target X even if they are not supported by target Y. This is a pragmatic if haphazard compromise; hopefully the Explain facility (which was introduced in Z39.50 version 3) will improve matters in the future.
Europagate can map the same access point name (e.g. "personal author") onto different Bib-1 attributes combinations on different targets. This is a useful facility which makes it possible to present the user with standard names for access points across targets which implement those access points using different combinations of Bib-1 attributes. Thus it was possible to present a fairly standard set of easily-understood named access points across all the targets; these are discussed in detail below.
Among the UK targets in particular, many had only just been installed when they were added in to BOPAC2 and it was necessary to rely on software vendor's information about their systems which was not always accurate. In these cases the only way to find out precisely how a particular target responded was by experiment, but there was not enough time to conduct exhaustive investigations of every target.
One or two targets offered more specific title searches such as uniform title or collective title but these were not usually included for the sake of simplicity; in the BOPAC2 model such precise distinctions can be added in after the retrieval phase. In the case of the British Library Document Supply Centre, Series Title was included as a separate access point for precise known-item searches.
BOPAC2's approach incorporates knowledge about the targets' search semantics into the origin to generate appropriate access points. This is a function of the Europagate software, and the Europagate gateway itself [31] works the same way. Other systems such as the Library of Congress Z39.50-WWW gateway [35], use a similarly adaptive interface. This is obviously a better approach but it involves extra administration in updating the origin every time the targets add new access points.
The Explain facility solves the problem by forcing the origin to interrogate the target to find out what attributes (and thus access points) are available. Explain is a version 3 facility and was not used in BOPAC2. Explain encapsulates the semantics of the target, and different targets will have different semantics. Under Z39.50 version 3, therefore, the origin should interrogate the relevant targets using Explain, identify the set of available the attributes on each target, and then compute the intersection of those sets so as to present the user with suitable access points which will be meaningful across all the targets (or at least most of them).
The approach taken in the Project, which was less dependent than other Z39.50 origins on the Z39.50 search mechanism itself, was more robust and better at smoothing out the differences between different targets. The user is shielded from the complexity of Z39.50's search and retrieve mechanism. The result is a friendlier and more accessible distributed search interface.
Nevertheless there were lessons to be learned from the design of the BOPAC1 interface sequence and other GUI OPAC's. Hildreth [37] suggests three facilities that would be useful in a GUI OPAC:
This meant that the workstation's memory capacity would limit the size of the retrieval set which could be processed. Network delays would also cause problems for users in downloading large retrieval sets; but it was impossible to predict how serious these problems were going to be.
BOPAC2 took a more object-oriented approach by defining a series of classes. First, the base class BibliographicRecord was defined. The generic MARC class was a subclass of BibliographicRecord, and inherited from MARC were the specific classes UKMARC and USMARC. MARC tags were represented by a name (e.g. "main entry title") rather than a tag number ("245a") so that they were not tied to any one particular MARC format. Other MARC formats can be incorporated as subclasses of the generic MARC class. Other bibliographic formats (not based on MARC) can be incorporated as subclasses of BibliographicRecord. Whether this approach is better or worse than simple format conversion is not clear, but it does demonstrate another way of tackling the problem which could become popular as more and more projects use object-oriented software development.
The BOPAC2 interface consists of the combined retrieval set (combined from the various Z39.50 targets) and of views of that set. Each view represents a subset of the records in the combined retrieval set displayed in a specific format. Each time the user interacts with the system, a new view is generated. The new view contains a subset of the records of the current view. So for example the first view the user sees contains the whole combined retrieval set and may consist of several works. If the user chooses one of these works, a new view is generated containing the subset of records which pertain to that work, displayed in full or partial record format.
There are many different classes for the different views; all are ultimately derived from a single base class (SetView). Another class (ClusterView) acts as a base class for viewing clusters. Specific subclasses (e.g. AuthorClusterView, PublisherClusterView) are then derived from these.
The interface contains a linked list of views. As the user works with the system they generate new views which are added to this list. The "back" and "forward" buttons move the user back and forth within the list. The interface is stateful and each view records a snapshot of the state. As the user retraces their steps back through previous views, the previous states are restored.
There is a class for the rather abstract concept of a cluster. A cluster object consists of a label (which names the cluster) and a set of contained records. This is used in different ways throughout the system. A cluster might for example be labelled with an author name and may contain records of items written by that author. Another cluster may be labelled with the date range (e.g. 1900-1990) and contain records with publication dates in that range.
Figure 1 Some typical routes through the BOPAC2 interface
Java's cross-platform promise turned out to be rather wide of the mark. Sun Microsystems (the inventors of Java) promise "write once, run anywhere" but the reality was more a case of "write once, debug everywhere". It seemed that each time the system was tested on a new platform, a new web browser, or a new version of a web browser, it had to be modified. The problems were nearly always to do with the display and the AWT peer classes. Both Netscape and Internet Explorer suffer from bugs in their early implementations of the AWT peer classes. Internet Explorer, for example, could not handle scroll bars correctly. There were also problems with X-Windows in monochrome and differences between X-Windows and Windows 95/NT.
Symantec Café was purchased with the original Java Development Kit (JDK 1.0.2). JDK 1.0.2's Abstract Windowing Toolkit (AWT) was notoriously weak, and substantial effort was required to make it display text in a useful way. During the development phase a new JDK (JDK 1.1) was released by Sun Microsystems, with an improved AWT. However it took some time for Symantec to incorporate the new JDK into Café, and the older Java-capable web browsers (e.g. Netscape version 3) can only handle JDK 1.0.2. Later browsers can also handle some parts of the new JDK but it was decided, in the interests of accessibility, to stick to JDK 1.0.2 as the lowest common denominator. The only facility from JDK 1.1. that was used was JAR archiving. Some web browsers can use this feature to cache the Java applet on the workstation, so improving performance.
Searching for the keywords "Java" and "applet" in the online proceedings of ACM CHI 96 and 97, one of the premier conferences on human computer interaction gives an indication of much, or little, work has been done on HCI aspects of Java based systems. In the 96 proceedings searching for applets produced no hits and searching for Java, only one. That article in fact only discussed Java as a future option saying that had it been available earlier it would have been a way round some HTML limitations. By contrast searching the 97 proceedings produces 4 hits for applets and 6 for Java.
From the 97 proceedings it is clear that some general lessons for Java development are starting to emerge, many of these are related to general WWW issues. In general designers must be conscious that they "don't know the capabilities or configuration of the applet user's machine " [38], this parallels the general situation with the WWW where you don't know the user's browser's capabilities i.e whether they have images on, what colours they are using and what screen size. Another important factor in Web applications in general and one that is being studied by HCI practitioners is that the time for responses and interactions is often not under control (or the "World Wide Wait" issue as it is often known).
Although there are a number of tools such as validators and "linters" for web pages similar tools for applets are still developing. The Project tackled these issues by testing the system on as large a number of browsers (including different versions of the same browser) and operating system as possible. This highlighted problems with the Java AWT toolkit in some situations and led us to make sure that features such as the use of colour were duplicated by the use of different fonts so that a similar effect would be seen on colour and monochrome screens. This was important to us if we were to realise the cross platform potential of Java, so we did not want to specify any system constraints, such as particular browser or screen size, other than Java support. This strategy would seem to have been successful since one platform that we did not have facilities to test on was the Macintosh but one questionnaire response commented that he was pleased to see it work first time on a Mac.
The policy undertaken by the Project was to use a fairly simple single Java applet of a modest screen size and concentrate the controls at the top of the applet. This meant that most users would see all the controls on start up and have a standard view of the application. One area of interface design that we were conscious of, and was commented on by users, was the lack of consistency between the Java applet and the HTML pages derived from the Europagate software that controlled the searching part of the system. The Europagate software has a standard "HTML forms and CGI" look and will look familiar to users who have used other non Java based Web interfaces to databases. Rewriting this part of the system was beyond the terms of reference of this project, but would be an obvious step forward in creating a complete easy to use system with a coherent "look and feel".
Once the first prototype version had been largely debugged, a list of potential evaluators was drawn up to test the system.
Z39.50 has a number of facilities for reducing a large result set. Most of these are concerned with displaying the results in brief or reduced form so that the user can select which records to retrieve in full. None of these facilities, however, can successfully fulfil the second objective of the catalogue and the main reason for this is that the model of the result set, as a group of unrelated item, does not reflect the real situation in which many of the retrieved records will be related in some way. Were these relationships to be modelled in the result set, they would provide the best mechanism for selecting relevant records to retrieve. This section describes how such relationships can be represented in the Z39.50 result set as they are in the BOPAC2 retrieval, using a cluster model.
In cases where the origin searches just one target, the use of such clusters would be more efficient than the traditional brief record/full record approach suggested by Z39.50. It would also lead to a more useful and informative user interface at the origin. In the case of a distributed search, a "meta-target" model is suggested as a way of clustering retrievals from distinct, remote catalogues.
Currently the Z39.50-1995 standard has emerged as the standard for interoperability between library catalogues. Whilst Z39.50 is by no means limited to bibliographic applications, such applications form a major part of current development work. The following discussion will focus on Z39.50 purely in the bibliographic domain, from the point of view of librarians and (most importantly) remote catalogue users. The aspects of Z39.50 discussed here do not necessarily apply to the whole Z39.50 standard in general.
Z39.50-1995 is "within the application layer of the OSI network model" [39]. As such, pedantically speaking, it is not really concerned with physical networking issues such as bandwidth and transmission rates. However, as with many similar protocols in the application layer (FTP, HTTP) some aspects of the standard have clearly been developed with a view to the unpredictable and limited bandwidth of underlying network layer, which in practice is nearly always TCP/IP. In Z39.50 terms, bandwidth affects the quantity of APDU's transmitted between origin and target, and the size of individual APDU's.
Z39.50 does not specifically address the question of how the origin decides what to retrieve from the target. Usually, the origin consults the user who makes a decision based on information supplied by the target as part of previous operations. This is a crucial step in the whole search and retrieve cycle. The user is interacting with the process, deciding what to retrieve, and he/she must have enough information to make this decision. So how does Z39.50 help?
The major disadvantage of this approach is that the user is asked to reformulate their search, perhaps add terms to the search, without knowing anything about the large result set they have generated. For example, how can the user be expected to filter the results by publication date when he/she may have no idea of the range of dates in the result set? This may explain why, on conventional OPAC's, few people bother to reformulate their query to reduce the number of hits [18]. In the Z39.50 environment the user is potentially working with an unfamiliar catalogue, which may or may not have the capability to filter the result set, and he/she is even less likely to attempt to reduce the result set by reformulating the query.
Targets that handle MARC records mostly seem to include the same core elements (author, title, and publication details) in their Brief records. Those that use unstructured records such as SUTRS (see later) are less predictable and their Brief records are unstructured. Potentially, these variations can cause problems when conducting a search across several targets. The Z39.50 ATS1 profile [40] could have pinned down the content of the Brief element set, but it didn't.
The content of Brief records is determined purely syntactically. They are simply a subset of the elements of the "Full" record. If there is vital information outside the scope of the Brief element set, the user will not see it. This is the cause of a common problem of most Z39.50 systems, where the target hits items which do not appear to match the user's original query. Brief records may not necessarily contain the elements which explain to the user why the record was hit. Examples of this are shown in Appendix 1.
Appendix 1 again shows examples of such repetition. The problem is particularly bad where the search is for famous works which have been published in several different editions. It is a well-known problem on conventional OPAC's [14]. The extent of the repetition will depend on the scope and content of the database, the type of search, the search terms, and other factors.
SUTRS is a Full record format, so the records generally contain more information than Brief records. Yet SUTRS records are smaller and less detailed than MARC records, so they can be transmitted more rapidly over a limited bandwidth network. Potentially, SUTRS records could be regarded (from the user's point of view) as a compact compromise between Brief records and MARC records. In addition, a clever target could ensure that each record contains one element which explains why the record was retrieved.
Unfortunately, few if any origins or targets at the moment work in this way. SUTRS records are typically used as a substitute for MARC records rather than a step towards them. They usually resemble the back-end catalogue's full record display, loosely modelled on the ISBD. The format of full record displays varies quite considerably among conventional OPAC's [42]. Further research is needed to determine how SUTRS records vary across Z39.50 targets, but the main point of SUTRS is that the format is left up to the target. It is unlikely that the format of SUTRS will ever be standardised sufficiently that it can be used for distributed searching.
SUTRS is a record syntax whereas Brief is an element set. Thus it is possible to have Brief records derived from SUTRS. SUTRS-based targets can return brief records, but the indications are than this makes the problems even worse (see the COPAC example in Appendix 1.
There is considerable semantic overlap between the elements of the Summary syntax and the tags of a MARC record. It seems almost certain that if any bibliographic targets were to use Summary syntax, the records would be machine-generated by selecting specific tags from corresponding MARC records.
This is a very flexible and powerful facility which enables the origin, rather than the target, to decide which elements will be included in the records returned by the PresentRequest. It gives the origin more control over the content of the records and enables a standardised retrieval pattern across different targets.
However, the origin is still working "blind" because the contents of the elements cannot be determined precisely; they depend ultimately on the cataloguing. Element specifications do not solve the problem of data repetition either, because the origin has to download the elements before it can tell that they contain repeated data. Currently few if any targets or origins implement eSpec1.
The introduction of an increasing number of generic syntax-oriented facilities to the standard has to some extent diluted the original idea of a pure machine-to-machine protocol. Nevertheless, many origins such as BookWhere [28] and Willow [24] attempt to summarise the results for the user by retrieving a full MARC record and extracting from it appropriate tags to generate something like the conventional OPAC's short title display. Other systems such as ZNavigator [29] and the Library of Congress WWW-to-Z39.50 gateway [35] attempt the same operation using Brief records. This is more risky (because the content of Brief records is not defined within the Z39.50 standard) but it saves network bandwidth and hence improves response time. Fretwell Downing's OLIB system also post-processes the retrieval [44].
In generating the short title display most origins carry out the same familiar transformation. They simply extract a subset of tags from the MARC record. As with Brief and SUTRS formats, this subset of tags may or may not explain why the record was hit. Creating "short titles" at the origin, rather than the target, can deliver a prettier display format to the end user, but in terns of content, there is the same potential for confusion over spurious records.
Firstly the keys are not clearly defined. The Sort operation may involve a set of attributes, an element specification, or some other target-specific aliases or database-specific keys.
Secondly, there are loopholes. The sort key may contain multiple Use attributes, in which case the target must decide which one to use. Even if there is only one Use attribute the situation is not clear. For example, bib-1 maps Use attribute 1003 (Author) onto 12 different MARC tags. From these tags, and potentially from repeating subtags within the tags, the target extracts (presumably) one single sort key. The origin cannot predict with any certainty how the target will make this selection. Perhaps the target will choose corporate author in preference to personal author, or perhaps the other way round. This is less of a problem with element specifications, where there is an implied priority, but there is still the question of repeating values. There is no profile detailing how Sort should be used in the bibliographic context, and currently few targets provide the Sort facility.
Conventional OPAC's usually make some attempt at collocation, or at least provide a facility to collocate the output, although the results are sometimes far from satisfactory [14]. If the result set is large and unordered (or badly ordered) the user is unlikely to plough on right to the end of it. He/she will probably give up before the end, satisfied with perhaps one or two relevant hits, and may never see other related records further on in the result set.
Currently most Z39.50 origins and targets produce apparently unordered retrievals which exhibit no collocation at all. The Sort facility should help to improve matters, but it is ambiguous and is unlikely to collocate the results any better than a conventional OPAC. In addition, in the Z39.50 environment, the origin typically retrieves records from the target in groups of 20-30 records at a time. If bandwidth is limited, the user has to wait each time to retrieve the next group of records. If the number of records is large, he/she is even more likely to give up early, and may miss relevant records towards the end of the result set because they take too long to download.
Version 4 of Z39.50 will include the new Type 102 Ranked List query which ranks the result set according to a relevancy score, so that the most relevant records will be those at the start of the result set. Such an approach may work for subject-based searching, but is not so applicable to known item author/title searches which are by far the most common type on many OPAC's (see for example Appendix 5).
At best, the user, searching remote networked catalogues with Z39.50 will face the same problems as he/she would face searching his or her own local catalogue. More likely is that the display deficiencies will be exacerbated by the user's lack of familiarity with the remote catalogues (although not nearly as much as if he/she were searching the catalogue through its own native interface). At worst, however, the user will have to cope with substantial variation (in terms of search formulation and record formatting) between Z39.50 targets.
It could be argued that a more tightly-defined profile (or full adherence to the profile) will improve inter-operability and enable distributed library projects such as MODELS [46] to come to full fruition. However, Z39.50 seems to be moving in the direction of generalisation rather than specialisation. Lynch, in his paper on the future of Z39.50 [47], points out that it is difficult to reconcile the future (apparent) direction for Z39.50, with the inter-operability and shared semantics required for effective distributed searching. In this Project, Z39.50 is used as a semantically rich protocol.
Z39.50-1995 [39] defines both the GRS-1 syntax and TagSet-M (the meta-data tagSet - [48]. TagSet-M includes an element Record which is designed "to present nested or subordinate records". Each subordinate record may have a schema specified (with the tagSet-M schemaIdentifier) which may be independent of the schema of the record it is contained in, and independent of the other subordinate records. Subordinate records may themselves contain their own subordinate records. Thus, with TagSet-M and GRS-1, it is theoretically possible to model any hierarchical heterogeneous relationships within the target datasets.
The work clusters in BOPAC2 can be seen a special case of this general model. Currently they are based on MARC records, which have a well-defined schema which is shared between target and origin. However the elements used in a work cluster could be described by tags in TagSet-G (the general tagSet)., which separates it from any one particular schema. [49] suggests that TagSet-G elements should have generic semantics, which can be made more specific within a particular schema. Thus work clusters could be either generic or specific to one schema.
The work clusters might be modelled in four parts as follows:
GRS-1 makes provision for any element to have a hit vector which can be attached to the author/title or the descriptor parts to show where these elements were hit. This could be used, for example, to highlight the position of keywords within the matching title or statement of responsibility.
How would this work in practice? Typically an origin would issue a SearchRequest. The target would either (a) search a database of cluster records, or, (b) search the underlying bibliographic database and generate cluster records on the fly. The origin would initiate a Present operation to retrieve the cluster records which would be displayed to the user. This would look similar to the first screen in BOPAC2. The user would then select appropriate clusters. The origin would then retrieve the records contained in the selected clusters and display them. Depending on what these records were, there could be further dialogue with the user, and more PresentRequests to retrieve other contained records. The user might potentially end up with a varied set containing record clusters, bibliographic or meta-data records, complete documents, abstracts/tables of contents, or other kinds of data.
Yet the BOPAC2 cluster model enables these disparate types of data to be presented within an easily-navigable, hierarchical display model. Navigating the BOPAC2 cluster model can be thought of as "opening up" the cluster "to see what's inside". It is an iterative process: successively narrowing scope and increasing detail. The target decides the level of detail that it supplies to the origin; the origin decides how much to show the user.
It must be stressed that the above is not intended to be an exhaustive or optimal proposal. It simply demonstrates that the clustering approach can be modelled generically and fairly easily in Z39.50. A origin and target capable of processing cluster records could engage in a dialogue with the user which would resemble that of BOPAC2, but would be more efficient than the existing Search/Present cycle. The elimination of repetition and redundancy in the retrieval has two benefits. It reduces the total amount of data transmitted, and hence makes better use of limited bandwidth. And it reduces the amount of extraneous material for the user to look through.
Since there is currently no way to handle duplication in Z39.50, BOPAC2 removes duplicates in the retrieval at the origin. The proposed Z39.50-based solution appears to have some disadvantages in that it is difficult to see how clusters of duplicates could be assembled in a distributed search environment. In fact the same argument applies to the general clustering model described here (see below). The other problem with duplicate handling at the target is that although the origin will be able to have some influence on how duplicates are detected, the actual detection algorithms will not form part of the standard and are likely to differ from one target to another; another pothole on the road towards effective interoperability.
At first sight the above proposal might appear awkward in a distributed search context where the origin attempts to fuse together the results from different servers. The merging or fusing together of clusters would require the origin to record where the constituent records within the clusters originally came from. A similar problem arises with the Scan and Sort facilities which do not map well onto a distributed search environment. Europagate does not allow scan across multiple targets.
Another problem is one of performance. The origin must maintain all the connections with the targets whilst the user interacts with them. Across the Internet the reliability of the network may not allow such connections to be maintained for long periods of time. Even if the connections can be maintained their throughput is likely to vary unpredictably from time to time. So a target which responds quickly during initialisation may suddenly slow down or time out for no apparent reason, a phenomenon which has been observed many times during the course of the Project while experimenting with remote Z39.50 targets. It could be argued that BOPAC2's current technique of downloading the result set onto the client workstation, may deliver more consistent performance than leaving it on the target.
The simplest arrangement for distributed search is an origin which initiates multiple z-associations to different targets. But this is only one possible arrangement. The MODELS project [46] suggests using a "search broker" which mediates between the origin and the remote targets. With this architecture there would be (at least) two ways to cluster the retrievals. The search broker could fuse together the remote target's clusters to create "virtual" clusters, which is what the origin would retrieve. Alternatively it could add an extra layer of clusters over the top, creating clusters of clusters. The same approach can be applied to WWW-to-Z39.50 gateways. Because such a mediator system would participate in both search and retrieval, the term "search broker" does not accurately describe it and "meta-target" is suggested as a better term.
There are other interesting possibilities here. Explain databases for stable targets could be cached at the meta-target, reducing the overhead at the start each z-association. It might be possible to take several origins which are simultaneously searching the same target and multiplex them (using concurrency and the Reference Id) so that they share the same outgoing z-association. Some databases might even be located on the meta-target leading to the possibility of a mixed architecture: a mixture of union and distributed catalogues.
The user also specifies the maximum number of records to retrieve from each target. Clearly the more records are requested, the longer the retrieval will take. A default value of 80 was chosen which works well for UK targets though this is often too high for US sites. It is possible to retrieve 1 record from each target which is useful to see whether or not an item is held there.
As described earlier Z39.50 targets often respond to terms in unexpected ways and it was clear from the evaluators' feedback that this was causing problems. In an effort to help users to specify their query correctly, some examples and hints were added to this search page.
The interface is based on the idea that the user sees more and more about less and less. The user traces a path through the interface, successively narrowing down the subset of records whilst looking at them in more detail. Ultimately, all paths through the interface end up at the Full View. Although there are a number of different views, most parts of the interface remain fixed from one view to another. Although there are many different views, most of the interface components remain in the same position throughout. Each view is essentially a list of choices, and at each stage the user can highlight the item(s) they want and press the [Look At These] button to see them in more detail. The interface is regular, orthogonal, and consequently easy to learn.
The Work Cluster View toolbar has the following functions:
The toolbar for Partial View has the following functions:
Unlike more line-based OPAC's Full View can display more than one full record at a time. Full records are automatically displayed whenever the total number of records selected by the user is below a configurable threshold (the FullRecordThreshold), currently set to 4. If at any stage the user selects a cluster with less than 4 records in it, these will be displayed as full records.
The toolbar in Full View has only one option, [Find], which is a free-text search the same as in Partial View.
The second thread was the attempt to compare BOPAC against other OPACs by some objective criteria as detailed below and the final thread was the user testing done in late 1997 and January 1998 which is analysed below. As noted before, because of problems with the operation of Z39.50 software on the Bradford Library system, this testing was not as extensive as hoped nonetheless useful results were obtained. Although rigorously controlled testing could not be undertaken, the system was maintained in a stable state to allow comparison of responses throughout this period. Minimal logging information was also kept which allowed us to investigate some of the comments and problems, raised in users' responses to us.
The worst case queries included the following works:
This very brief test on two targets indicates that BOPAC2's clustering algorithm works as well or better than conventional collocation. A more extensive study conducted over a wider range of catalogues would be able to measure the effectiveness of these algorithms more comprehensively and might also be a useful tool for refining them.
The chart below summarises the distribution of different sizes of these 1500 retrievals. Many retrievals are very small but there are a significant proportion of larger retrievals over 65 records.
Then a feedback questionnaire was designed (based on the British Library OPAC 97 feedback questionnaire) and added as a web form. The questionnaire is shown in Appendix 2. Users were asked to fill the questionnaire in at the end of the session. Library users in local academic libraries in Leeds and Bradford Universities were encouraged to try the system. To this end, each university was given its own special entry point with targets tailored specifically to those university libraries. In Bradford University library a dedicated PC was set up for access to BOPAC2. The URL for BOPAC2 was announced on LIS-LINK, AUTOCAT and the Z39.50 Implementers Workshop e-mail lists and a link was created from the BOPAC2 Project home page and from the NISS list of OPAC's.