The evaluations of results achieved in the project ENRICH Work Packages (WP),
namely WP 3 – WP 6, started in April 2009 and was finished in early October. The
WP 5 was evaluated May – June, WP 4 – September. In total 205 respondents
were involved into process of evaluation by filling the evaluation forms online.
Each respondent had to express own opinion on the specified result, assigning a
score – an integer ranging from 0 (poor or not available) to 4 (excellent). The
double average values over respondents and questions evaluated ensured more
stable estimator of quality. The structure of respondents profile and their
general opinions on results achieved during 24 months of project work are
illustrated in the Fig.1 and Fig.2. The results of evaluation were analyzed in
many aspects and are presented in this report: illustrated by Fig. 1–25 with
corresponding comments.
First of all, the results were considered across the different users groups in
order to investigate how their needs were satisfied, see Fig. 2–19. Secondly,
the average scores, assigned to the separate questions on results achieved in
WPs by all target groups of respondents, were recalculated as estimates of
Categories and Quality Criteria over investigated WPs, the results are shown in
the Fig. 20 – 21. The approximate 0,95 confidence intervals were fitted to the
estimators of quality considered in all aspects: WPs, Criteria, Categories and
opinions of target groups in order to make statistically reliable inference. It
is shown that with the 0,95 confidence level we can confirm that the created
Processes, Tools and Objects are the strongest properties achieved in the ENRICH
results. Similarly, the properties of Interoperability and Adaptability are the
best (with the 0,95 confidence level) in ENRICH results when compared with other
quality Criteria such as Multilinguality or Usability. The numerical values of
estimated quality aspects are shown in diagrams using two ways: the average
scores ranging from 0 to 4 (Fig. 1–19) and their percentages (Fig. 22 – 25) with
the lower and the upper confidence limits.
I . The Results Across the Different Target Users Groups
The total number of the respondents involved into evaluation of ENRICH results and testing activities was 205. The four target users groups were considered: content providers-information managers (106), technical personnel-supporting staff (57), and scholars – researchers in historical documents, students (20), and the general or the end-users having general interests (22). The structure of respondents corresponds well to the aims of the project which is oriented more to the experts in the area than to the users of general interest. Regretfully, the activity of scholars during whole testing period was rather low; it is more comparable to general users than to experts. But certainly it reflects the real structure of users because the ratio of target groups was rather stable during the evaluation process as seen from Fig. 1, cases (a) and (b).
(a)

(b)

Fig.1. (a) The distribution of 205 respondents over the four target users groups: content providers – information managers, technical personnel – supporting staff, scholars – researchers in historical documents, students and the general or the end-users having general interests. (b) The distribution of 116 respondents at the 18th month of project work, the proportions of different users evaluating results were almost stable.

Fig.2. Project results in WP3, WP4, WP5, WP5n, WP6 evaluated by target users groups, compared to the maximum possible score 4.
The summary results shown at the Fig. 2 demonstrate rather similar opinion of
scholars, technical staff and content providers while general users seem to be
more satisfied with ENRICH project results than the experts. The results are
rather similar to those derived at the month 18th of ENRICH work (6 months to
the project end). In the last part of this report we will consider the
confidence limits for each group of users in order to make a statistically
correct conclusion.
The data from the questionnaires filled on-line by the project partners are
shown in the Fig.3. What conclusions can be made about a quality from such data?
It would be difficult to judge from such raw data – detailed statistical
analysis and investigation were made and described in the IV section of this
Report, enabling to make statistically correct inferences.

Fig.3. The investigated properties (19 questions asked) concerning the quality in WP 3, WP 4, WP 5, WP5 new and WP 6, evaluated by target users groups (the individual scores were ranging from 0 to 4). The average values are marked numerically on the blue line showing the average.

Fig.4. The sum of scores of 19 questions answered by 205 respondents.
The sums of scores in Fig.8. is the highest at WP 3b = 15,02 and WP 5c = 14,76. The spread of the opinions among the users groups is rather equal in the WP 3, WP 4 and WP 5 questions but have much larger variations in the WP 6 quality evaluation. The smallest summary value is for WP 6d = 9,28 it contains also the smallest question score 1,33 assigned by scholars. The last evaluation from WP 5 new-a to WP 5new-f was difficult for general users and was done only by experts’ groups.

Fig.5. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 3 results, evaluated by different users groups. The average of WP 3 (located at the right of this diagram) shows the spread of opinions on quality in WP 3 by target users. The total average of WP 3 is equal to 3,32.
The questions in WP 3 were rather technical and certainly difficult to access for general users and sometime for scholars – researchers in the historical documentary heritage. Investigating more thoroughly the results in Fig. 5 we see that the four extreme values equal to 4 are allocated at WP3-a and WP3-b by scholars and general users and they can affect significantly the average of WP 3. Let us exclude the extreme values and apply the stratified sampling for evaluating WP 3 results only by real experts: content providers, information managers (15 respondents in WP 3) and technical personnel, supporting staff (17 respondents in WP 3). Happily, they were in majority of this sample (compared to 2 scholars and 4 general users only).

Fig.6. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 3 results, evaluated only by expert users groups. The average at the right of this diagram shows the total averages assigned by content providers and technical staff and the total average of WP3 given by experts, it is equal to 3,26.

Fig.7. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 4 results, evaluated by target users groups. The average calculated across the WP 4 questions shows the total evaluation of a quality in WP 4. The average of WP 4 in total is equal to 3,27.

Fig. 8. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 5 results, evaluated by target users groups. The average calculated across the WP 5 questions shows the total evaluation of a quality in WP 5. The average of WP 5 in total is equal to 3,32.

Fig. 9. The questions in WP 5 new evaluation were so specific that users of general interest and scholars were not able to answer them and the number of target groups naturally becomes smaller and equal to two expert groups – content providers and technical staff (instead of previous four groups).

Fig.10. The averages of the numerical values, assigned to each of the six questions, reflecting the quality of WP 5 new evaluation results, evaluated by the expert users groups.
The Fig. 10 shows the results provided by 31 respondents – content providers and technical staff participating in both versions of evaluation held in the second evaluation. They have almost the same opinion, as an average in the right of diagram shows evidently. The new evaluation resulted in total to average value 3,135.

Fig.11. The averages of the numerical values, assigned by users to facilities created in WP 5 during the first (I) evaluation and the second (II). Users of general interest have been very optimistic during the first evaluation, (resulting in 3,78 score from the maximum possible 4).

Fig.12. The averages of the numerical values, assigned to each
of the four questions, reflecting the quality of WP 6 results, evaluated by
target users groups, 37 respondents.
The average values located at the right of the diagram in Fig. 12 shows the
total evaluation of a quality in WP 6 by different users. The average of WP 6 in
total is 2,66. This is the lowest average obtained from all evaluations
performed in WPs.
II. The Scores in Work Packages Assigned by All Users

Fig.13. The average scores of 3 questions on a quality in WP 3 evaluated by full sample (38 respondents) of users. The four extreme values equal to 4, assigned by scholars and general users, located in WP3(a) and WP3(b).

Fig.14. The average scores of 3 questions on a quality in WP 3 evaluated by stratified sample including only the expert users. Total number of respondents there was 32. The extreme values, affecting the final average, were excluded. By experts evaluation the average of WP 3 in total is 3,26 (compare to a the previous result 3,32).

Fig.15. The average scores of 3 questions on a quality in WP 4 evaluated by full sample (50 respondents) of users.

Fig.16. The average scores of 3 questions on a quality in WP 5 evaluated during the first evaluation (49 respondents) by all users of WP 5. The general users once more assigned rather high scores to all questions.

Fig.17. The average scores of 6 questions on a quality in WP 5 during the second evaluation (31 respondents) done by the expert users of WP 5.

Fig.18. The average scores of 4 questions on a quality in WP 6 as were evaluated by all users (37 respondents). The scholars have the most spread opinions in contrast to general users having almost uniformly good opinion.

Fig.19. The average total scores, reflecting a quality in WP 3, WP 4, WP 5, WP 5 new, and WP 6 evaluated by different users the pooled sample of 205 respondents.
It seems that the quality of results achieved in the WP 3 and WP5 were evaluated higher than in WP 6 – that is an integrated opinion of 205 respondents involved into the final evaluation and illustrated in Fig.19. This conclusion is confirmed later by fitted confidence intervals to estimates and shown in the Fig.22.
Let us compare the total averages given by various users to WP 3, WP 4, WP 5 and WP 6. Those quantities are 3,32 (or 3,26 evaluated only by experts), 3,27, 3,32, 3,135 and 2,66, respectively. Are those estimates significantly different? The problem of statistical inference is as follows: say, testing a hypothesis that WP 3 results are better than those of WP 6, will be addressed in the IV section of this report, and demonstrated in the Fig. 22.
III. The Scores in Categories and Main Criteria
Now let us derive the average scores across the four categories (digital objects, tools developed in ENRICH project, processing, and a repository as a whole) and five Main Criteria in quality from all collected data from 205 respondents participating in the testing and evaluating activities and reflecting their opinions during several sessions of evaluation. The evaluations of results, achieved in the project ENRICH Work Packages, namely the WP 3 – WP 6, started in April 2009 and were finished in November 2009. The WP 5 was evaluated May – June, and repeatedly in October - November 2009, WP 4 – September 2009, WP5 new finished to the end of November 2009.

Fig.20. The average scores of the four categories as the components of a quality in ENRICH project results achieved in WP 3, WP 4, WP 5, WP 5n, and WP 6 extracted from a pooled sample of 205 respondents.
The maximum possible score for the results shown in the Fig.20 is 4. Visually the weakest point is in the category Repository. Is it true? – it will be checked in the IV section of this report by constructing the confidence intervals for each category.

Fig.21. The average scores of the five Main Criteria reflecting the quality of ENRICH project results, achieved in WP 3, WP 4, WP 5, WP 5n, and WP 6 extracted from a total sample of 205 respondents obtained during several evaluation sessions.
The Interoperability and Adaptability in the Fig.21 seems to be estimated better than Usability and Multilinguality or Security. Are those differences significant? The question of correct comparison of available evaluation results in WP, Categories and Criteria will be answered in the following section.
IV. Statistical Inference on Derived Results – Confidence Limits of Estimators
In order to test correctly the statistical hypothesis that say the WP 3
results are better than those of WP6, let us fix the standard significance value
0,05, corresponding to 0,95 confidence level (and the critical value 1,96 in the
normal approximation of statistics used). Then the following 0,95 confidence
interval (N. Kligiene, 2009) is fitted: p*+ 1,96 [p*
(1- p*) / 4nk]1/2. Here p* is an average
quality estimator divided by 4 – the maximum possible its value, n – number of
respondents, k – number questions used for evaluation. Applying this formula we
have the confidence intervals for WPs investigated, summarized in the following
table.
Because the confidence intervals for WP 3 results (in both cases) and WP5 are overlapping we can conclude that there is no significant difference among them. But evidently a different conclusion follows when considering a difference in quality of WP 3 (or WP 5) and WP 6 – it is significant and with the probability equal 0,95 we can confirm that results in WP 3 are better than in WP 6. The same conclusion evidently seen from the Fig. 18.

The conclusion what follows from this Table and it is seen evidently from the Fig. 22, where the lower and upper 0,95 confidence limits are displayed, is following. Because the confidence intervals for WP 3 results (in both cases: all respondents and only experts) and WP4, WP5 are overlapping we can conclude that there is no significant difference among them. WP 5n results are similar to the mentioned above. But evidently a different conclusion follows when considering a difference in a quality of the WP 6 and other WPs – it is significant and with the probability equal to 0,95 we can confirm that results in WP 3 (or WP 4, WP 5) are better estimated than in WP 6. The approximate upper and lower confidence limits (CL) are derived for each case and displayed in Fig. 22. The results are related to those in Fig.19 and show evidently that the significant difference in lower quality what can be concluded is in the WP 6. Other results of a quality are comparable to each other with the confidence level 0,95.

Fig.22. The average evaluation of a quality in the ENRICH project results, achieved in WP 6, WP 5, WP 5n, WP 4, and WP 3 evaluated by all users (by experts only in WP 3 and WP 5n).

Fig.23. The percentages of the four categories as the components of a quality in the ENRICH project results, achieved in WP 3, WP 4, WP 5, WP 5n and WP 6 as they were evaluated by 205 respondents. The approximate upper and lower confidence limits (CL) are derived for each category considered.
Results displayed in the Fig. 23 show that we can conclude with the 0,95 confidence that the developed Processing, Tools and Objects are equally good evaluated as ENRICH results especially when compared with Repository as a whole. The Repository’s evaluation is rather good also. Looking in general, all these aspects were evaluated well, the points assigned to those categories (from the maximum possible 100 points) all are good. According to the Methodology for Evaluation, described in D 7.1, the score falling in the interval 0 – 25 means that the result is low, 26 – 75 it is rather good (satisfactory), 76 – 100 it is very good. Therefore the Tools, Objects and Processing received very good evaluation from the ENRICH partners and other related institutions.

Fig.24. The percentages of the four Main Criteria reflecting a quality of the ENRICH project results, achieved in WP 3, WP 4, WP 5, WP 5 n and WP 6 as they were evaluated by 205 respondents during several evaluation sessions.
The approximate upper and lower confidence limits (CL) derived for each Criterion. Interoperability and Adaptability received very high scores and those are different from the evaluated Multilinguality or Usability properties with the 0,95 confidence level. Results displayed in the Fig.24 show that we can conclude with the 0,95 confidence level that the Interoperability and Adaptability are the best ENRICH results when compared with other Criteria such as Multilinguality, Security or Usability. Looking in general, all involved Criteria have got the very high estimates, even the lowest result 69,75 assigned to Multilinguality (from the maximum possible 100 points) is rather good. The Interoperability, Adaptability and Usability received very good evaluation or the excellent mark from the ENRICH partners and users from other related institutions.

Fig.25. The percentages of satisfaction of target groups on a quality of the ENRICH project results, achieved in WP3, WP 4, WP 5, WP5n and WP 6, as they were evaluated by 205 respondents during several evaluation sessions.
The approximate upper and lower confidence limits derived for each group of
users show that there are no significant difference among experts’ users
opinions – almost all intervals are overlapping. The real difference can be
stated only between opinions of experts groups and the general users. The
general users are more satisfied with the ENRICH project results than the
experts – content providers, technical staff and scholars. The tendency that
estimates of general users’ are higher than other users groups were noticed in
all aspects of investigation performed, but a real reason of this effect is
unclear, one can only guess (but this lies outside our Evaluation Report). We
can confirm only, that during the process of evaluation, at first stages it was
not possible to make similar conclusions on target groups of users until
sufficient sample sizes in the separate groups were not attained at the final
stage of evaluation.
Finally we have the sample size equal to 205 respondents containing: content
providers-information managers (106), technical personnel-supporting staff (57),
scholars – researchers in historical documents, students (20), and the general
or the end-users having general interests (22). This enables us to make a
statistically reliable inference that general users are more satisfied with a
quality in digital repository than more qualified expert users.
V. Conclusions and Comments of Obtained Results
Interoperability and Adaptability received very high scores and those are significantly different from the evaluated Multilinguality, Usability and Security properties (Fig.24 and the Table II below).
Processes, Objects and Tools were evaluated better than whole Repository (Fig.23 and the Table III)
The quality of the WP 6 results concerning multilingual access was evaluated to be significantly lower than other work packages: WP 3, WP 4, WP 5, WP 5n (Fig.22 and the Table I).
§ All results are evaluated rather highly: the most optimistic were the general users, less – the experts (Fig.6). Statistically their opinions are different with the 0,95 probability (Fig.25 and the Table IV).
Table I. Work Packages. Estimated Scores and Their Percentages of the Maximum Possible

Therefore we do not focus on those estimates only. The main idea is to extract from those estimates the information related to the Main Quality Criteria and Categories. The relationships of WP 3 – WP 6 items and Quality Criteria were established in advance, in the Methodology D-7.1, and the estimates of Criteria and Categories derived from the pooled sample from several evaluation actions performed in the framework of the WP 7. Recalculated estimates, displayed in the Fig. 23, show that we can conclude with the 0,95 confidence level that the developed Processes, Tools and Objects are equally good evaluated as the ENRICH results, especially when compared with facilities in Repository as a whole. But the Repository evaluation is also rather good. Looking in general, all these aspects were evaluated well, the points assigned to those categories (from the maximum possible 100 points if we use percentages) all are good.
Similarly, results displayed in the Fig.24, show that we can conclude with the 0,95 confidence level that the Interoperability and Adaptability are the best ENRICH results when compared with other Criteria such as Multilinguality, Security or Usability. Looking in general, all involved Criteria have got very high estimates, even the lowest result 69,75 assigned to Multilinguality (from the maximum possible 100 points) is rather good. The Interoperability, Adaptability and Usability received very good evaluation or the excellent mark from the ENRICH partners and users from other related institutions. Let us remember that according to the Methodology for Evaluation (D-7.1) the score falling in the interval 0 – 25 means that the result is low, 26 – 75 it is rather good (satisfactory), 76 – 100 it is very good.
Considering a satisfaction of target users groups, the approximate upper and lower confidence limits derived for each group of users show that there are no significant difference among the experts’ users opinions – almost all intervals in the Fig.25 are overlapping. The real difference can be stated only between opinions of experts groups and the general users. The general users are more satisfied with the ENRICH project results than the experts – content providers, technical staff and scholars.
The results of the quality evaluation performed on the base of total sample of 205 respondents expressing their opinion on the achievements of the project ENRICH are summarized in the following three tables.
Table II. Quality Criteria – Estimated Scores and Their Percentages of the Maximum Possible Values

Table III. The Categories – Estimated Scores and Their Percentages of the Maximum Possible Values

Table IV. Satisfaction of Target Groups’ Users – Estimated Scores and Their Percentages of the Maximum Possible Values

The evaluation process was composed of the several actions performed from the April to the December 2009 and the evaluation results were in a permanent change when more data were obtained. The dynamic of those changes was fixed in the Progress Reports made every a half year. The last evaluation data show a real stabilization of estimates concerning the Criteria or Users satisfaction; they added minor changes to results obtained in the previous evaluations. This means that we have consistent estimates of quality in all aspects.
The original methodology for evaluation of quality in digital repository were developed enabling to obtain an universal estimator of quality, not dependent on the number of criteria used for evaluation or individual opinions of evaluators. The structural model proposed enables the multifaceted aspects to be evaluated, to establish relationships between Categories and Criteria and to have a matrix of those relationships across the set of Criteria / Categories. The approximate confidence intervals of proposed estimators of quality, fitted to each items under investigation, enables to make reliable statistical inferences on differences of estimated values or their comparability to each other with the fixed confidence level, usually with the 0,95 probability.

References
[1] Manuscriptorium Digital Library
http://www.manuscriptorium.eu (www.manuscriptorium.com) [accessed on
5 December 2009]
[2] ENRICH - European Networking Resources and Information concerning
Cultural Heritage project (2007 - 2009)
http://enrich.manuscriptorium.com/
[accessed on 5 December 2009]
[3] MINERVA Technical Guidelines document, [accessed on 5 December
2009]
http://www.minervaeurope.org/publications/technicalguidelines.htm
[4] Brooke, John (1986) SUS - A quick and dirty usability scale,
[accessed on 5 December 2009]
http://www.usabilitynet.org/trump/documents/Suschapt.doc
[5] UsabilityNet: Usability Resources for practitioners and managers
http://www.usabilitynet.org/home.htm;
http://www.usabilitynet.org/tools/methods.htm [accessed on 5 December 2009].
[6] Kligiene, Nerute, (2009) E-Accessibility Marking a Quality of Digital
Repository, Proceedings of 2nd International Multi-Conference on
Society, Cybernetics and Informatics v. II, July 10-13, 2009, Orlando, Florida,
USA, p.p. 167- 172.
[7] Kligiene, Nerute, Structural Model for Digital Repository Quality
Evaluation in Context of Usage, Proceedings of eChallenges 2009 Conference,
21-23 October, Istanbul.
[8] Quality principles for Cultural Websites: a Handbook, 2005 Minerva
Project,
http://www.minervaeurope.org/userneeds/qualityprinciples.htm [accessed on 5
December 2009]
[9] Testing of e-Applications Developed in ENRICH: Usability,
Evaluation of Migration Tool, Personalized Translation, Personalization for
Contributors and Users
www.musicalia.lt/sus/;
www.musicalia.lt/eta/ [accessed on 5 December 2009]
The all project partners have to participate in evaluation process. Thank you for your efforts in evaluating the first results derived in WPs until December 2008.