Results of Evaluation and Testing in WP 3 – WP 6 and Quality Criteria

 

The evaluations of results achieved in the project ENRICH Work Packages (WP), namely WP 3 – WP 6, started in April 2009 and was finished in early October. The WP 5 was evaluated May – June, WP 4 – September. In total 205 respondents were involved into process of evaluation by filling the evaluation forms online. Each respondent had to express own opinion on the specified result, assigning a score – an integer ranging from 0 (poor or not available) to 4 (excellent). The double average values over respondents and questions evaluated ensured more stable estimator of quality. The structure of respondents profile and their general opinions on results achieved during 24 months of project work are illustrated in the Fig.1 and Fig.2. The results of evaluation were analyzed in many aspects and are presented in this report: illustrated by Fig. 1–25 with corresponding comments.
First of all, the results were considered across the different users groups in order to investigate how their needs were satisfied, see Fig. 2–19. Secondly, the average scores, assigned to the separate questions on results achieved in WPs by all target groups of respondents, were recalculated as estimates of Categories and Quality Criteria over investigated WPs, the results are shown in the Fig. 20 – 21. The approximate 0,95 confidence intervals were fitted to the estimators of quality considered in all aspects: WPs, Criteria, Categories and opinions of target groups in order to make statistically reliable inference. It is shown that with the 0,95 confidence level we can confirm that the created Processes, Tools and Objects are the strongest properties achieved in the ENRICH results. Similarly, the properties of Interoperability and Adaptability are the best (with the 0,95 confidence level) in ENRICH results when compared with other quality Criteria such as Multilinguality or Usability. The numerical values of estimated quality aspects are shown in diagrams using two ways: the average scores ranging from 0 to 4 (Fig. 1–19) and their percentages (Fig. 22 – 25) with the lower and the upper confidence limits.

 

I . The Results Across the Different Target Users Groups

The total number of the respondents involved into evaluation of ENRICH results and testing activities was 205. The four target users groups were considered: content providers-information managers (106), technical personnel-supporting staff (57), and scholars – researchers in historical documents, students (20), and the general or the end-users having general interests (22). The structure of respondents corresponds well to the aims of the project which is oriented more to the experts in the area than to the users of general interest. Regretfully, the activity of scholars during whole testing period was rather low; it is more comparable to general users than to experts. But certainly it reflects the real structure of users because the ratio of target groups was rather stable during the evaluation process as seen from Fig. 1, cases (a) and (b).

(a)

(b)

Fig.1. (a) The distribution of 205 respondents over the four target users groups: content providers – information managers, technical personnel – supporting staff, scholars – researchers in historical documents, students and the general or the end-users having general interests. (b) The distribution of 116 respondents at the 18th month of project work, the proportions of different users evaluating results were almost stable.

 

Fig.2. Project results in WP3, WP4, WP5, WP5n, WP6 evaluated by target users groups, compared to the maximum possible score 4.

 

The summary results shown at the Fig. 2 demonstrate rather similar opinion of scholars, technical staff and content providers while general users seem to be more satisfied with ENRICH project results than the experts. The results are rather similar to those derived at the month 18th of ENRICH work (6 months to the project end). In the last part of this report we will consider the confidence limits for each group of users in order to make a statistically correct conclusion.
The data from the questionnaires filled on-line by the project partners are shown in the Fig.3. What conclusions can be made about a quality from such data? It would be difficult to judge from such raw data – detailed statistical analysis and investigation were made and described in the IV section of this Report, enabling to make statistically correct inferences.

Fig.3. The investigated properties (19 questions asked) concerning the quality in WP 3, WP 4, WP 5, WP5 new and WP 6, evaluated by target users groups (the individual scores were ranging from 0 to 4). The average values are marked numerically on the blue line showing the average.

 

Fig.4. The sum of scores of 19 questions answered by 205 respondents.

The sums of scores in Fig.8. is the highest at WP 3b = 15,02 and WP 5c = 14,76. The spread of the opinions among the users groups is rather equal in the WP 3, WP 4 and WP 5 questions but have much larger variations in the WP 6 quality evaluation. The smallest summary value is for WP 6d = 9,28 it contains also the smallest question score 1,33 assigned by scholars. The last evaluation from WP 5 new-a to WP 5new-f was difficult for general users and was done only by experts’ groups.

Fig.5. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 3 results, evaluated by different users groups. The average of WP 3 (located at the right of this diagram) shows the spread of opinions on quality in WP 3 by target users. The total average of WP 3 is equal to 3,32.

The questions in WP 3 were rather technical and certainly difficult to access for general users and sometime for scholars – researchers in the historical documentary heritage. Investigating more thoroughly the results in Fig. 5 we see that the four extreme values equal to 4 are allocated at WP3-a and WP3-b by scholars and general users and they can affect significantly the average of WP 3. Let us exclude the extreme values and apply the stratified sampling for evaluating WP 3 results only by real experts: content providers, information managers (15 respondents in WP 3) and technical personnel, supporting staff (17 respondents in WP 3). Happily, they were in majority of this sample (compared to 2 scholars and 4 general users only).

Fig.6. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 3 results, evaluated only by expert users groups. The average at the right of this diagram shows the total averages assigned by content providers and technical staff and the total average of WP3 given by experts, it is equal to 3,26.

 

Fig.7. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 4 results, evaluated by target users groups. The average calculated across the WP 4 questions shows the total evaluation of a quality in WP 4. The average of WP 4 in total is equal to 3,27.

 

Fig. 8. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 5 results, evaluated by target users groups. The average calculated across the WP 5 questions shows the total evaluation of a quality in WP 5. The average of WP 5 in total is equal to 3,32.

 

Fig. 9. The questions in WP 5 new evaluation were so specific that users of general interest and scholars were not able to answer them and the number of target groups naturally becomes smaller and equal to two expert groups – content providers and technical staff (instead of previous four groups).

 

Fig.10. The averages of the numerical values, assigned to each of the six questions, reflecting the quality of WP 5 new evaluation results, evaluated by the expert users groups.

The Fig. 10 shows the results provided by 31 respondents – content providers and technical staff participating in both versions of evaluation held in the second evaluation. They have almost the same opinion, as an average in the right of diagram shows evidently. The new evaluation resulted in total to average value 3,135.

Fig.11. The averages of the numerical values, assigned by users to facilities created in WP 5 during the first (I) evaluation and the second (II). Users of general interest have been very optimistic during the first evaluation, (resulting in 3,78 score from the maximum possible 4).

Fig.12. The averages of the numerical values, assigned to each of the four questions, reflecting the quality of WP 6 results, evaluated by target users groups, 37 respondents.

The average values located at the right of the diagram in Fig. 12 shows the total evaluation of a quality in WP 6 by different users. The average of WP 6 in total is 2,66. This is the lowest average obtained from all evaluations performed in WPs.

To top

II. The Scores in Work Packages Assigned by All Users

Fig.13. The average scores of 3 questions on a quality in WP 3 evaluated by full sample (38 respondents) of users. The four extreme values equal to 4, assigned by scholars and general users, located in WP3(a) and WP3(b).

 

Fig.14. The average scores of 3 questions on a quality in WP 3 evaluated by stratified sample including only the expert users. Total number of respondents there was 32. The extreme values, affecting the final average, were excluded. By experts evaluation the average of WP 3 in total is 3,26 (compare to a the previous result 3,32).

 

Fig.15. The average scores of 3 questions on a quality in WP 4 evaluated by full sample (50 respondents) of users.

 

Fig.16. The average scores of 3 questions on a quality in WP 5 evaluated during the first evaluation (49 respondents) by all users of WP 5. The general users once more assigned rather high scores to all questions.

 

Fig.17. The average scores of 6 questions on a quality in WP 5 during the second evaluation (31 respondents) done by the expert users of WP 5.

 

Fig.18. The average scores of 4 questions on a quality in WP 6 as were evaluated by all users (37 respondents). The scholars have the most spread opinions in contrast to general users having almost uniformly good opinion.

 

Fig.19. The average total scores, reflecting a quality in WP 3, WP 4, WP 5, WP 5 new, and WP 6 evaluated by different users the pooled sample of 205 respondents.

It seems that the quality of results achieved in the WP 3 and WP5 were evaluated higher than in WP 6 – that is an integrated opinion of 205 respondents involved into the final evaluation and illustrated in Fig.19. This conclusion is confirmed later by fitted confidence intervals to estimates and shown in the Fig.22.

Let us compare the total averages given by various users to WP 3, WP 4, WP 5 and WP 6. Those quantities are 3,32 (or 3,26 evaluated only by experts), 3,27, 3,32, 3,135 and 2,66, respectively. Are those estimates significantly different? The problem of statistical inference is as follows: say, testing a hypothesis that WP 3 results are better than those of WP 6, will be addressed in the IV section of this report, and demonstrated in the Fig. 22.

To top

III. The Scores in Categories and Main Criteria

Now let us derive the average scores across the four categories (digital objects, tools developed in ENRICH project, processing, and a repository as a whole) and five Main Criteria in quality from all collected data from 205 respondents participating in the testing and evaluating activities and reflecting their opinions during several sessions of evaluation. The evaluations of results, achieved in the project ENRICH Work Packages, namely the WP 3 – WP 6, started in April 2009 and were finished in November 2009. The WP 5 was evaluated May – June, and repeatedly in October - November 2009, WP 4 – September 2009, WP5 new finished to the end of November 2009.

Fig.20. The average scores of the four categories as the components of a quality in ENRICH project results achieved in WP 3, WP 4, WP 5, WP 5n, and WP 6 extracted from a pooled sample of 205 respondents.

The maximum possible score for the results shown in the Fig.20 is 4. Visually the weakest point is in the category Repository. Is it true? – it will be checked in the IV section of this report by constructing the confidence intervals for each category.

 

Fig.21. The average scores of the five Main Criteria reflecting the quality of ENRICH project results, achieved in WP 3, WP 4, WP 5, WP 5n, and WP 6 extracted from a total sample of 205 respondents obtained during several evaluation sessions.

The Interoperability and Adaptability in the Fig.21 seems to be estimated better than Usability and Multilinguality or Security. Are those differences significant? The question of correct comparison of available evaluation results in WP, Categories and Criteria will be answered in the following section.

To top

IV. Statistical Inference on Derived Results – Confidence Limits of Estimators

In order to test correctly the statistical hypothesis that say the WP 3 results are better than those of WP6, let us fix the standard significance value 0,05, corresponding to 0,95 confidence level (and the critical value 1,96 in the normal approximation of statistics used). Then the following 0,95 confidence interval (N. Kligiene, 2009) is fitted: p*+ 1,96 [p* (1- p*) / 4nk]1/2. Here p* is an average quality estimator divided by 4 – the maximum possible its value, n – number of respondents, k – number questions used for evaluation. Applying this formula we have the confidence intervals for WPs investigated, summarized in the following table.
 

 

Because the confidence intervals for WP 3 results (in both cases) and WP5 are overlapping we can conclude that there is no significant difference among them. But evidently a different conclusion follows when considering a difference in quality of WP 3 (or WP 5) and WP 6 – it is significant and with the probability equal 0,95 we can confirm that results in WP 3 are better than in WP 6. The same conclusion evidently seen from the Fig. 18.

The conclusion what follows from this Table and it is seen evidently from the Fig. 22, where the lower and upper 0,95 confidence limits are displayed, is following. Because the confidence intervals for WP 3 results (in both cases: all respondents and only experts) and WP4, WP5 are overlapping we can conclude that there is no significant difference among them. WP 5n results are similar to the mentioned above. But evidently a different conclusion follows when considering a difference in a quality of the WP 6 and other WPs – it is significant and with the probability equal to 0,95 we can confirm that results in WP 3 (or WP 4, WP 5) are better estimated than in WP 6. The approximate upper and lower confidence limits (CL) are derived for each case and displayed in Fig. 22. The results are related to those in Fig.19 and show evidently that the significant difference in lower quality what can be concluded is in the WP 6. Other results of a quality are comparable to each other with the confidence level 0,95.

Fig.22. The average evaluation of a quality in the ENRICH project results, achieved in WP 6, WP 5, WP 5n, WP 4, and WP 3 evaluated by all users (by experts only in WP 3 and WP 5n).

 

Fig.23. The percentages of the four categories as the components of a quality in the ENRICH project results, achieved in WP 3, WP 4, WP 5, WP 5n and WP 6 as they were evaluated by 205 respondents. The approximate upper and lower confidence limits (CL) are derived for each category considered.

Results displayed in the Fig. 23 show that we can conclude with the 0,95 confidence that the developed Processing, Tools and Objects are equally good evaluated as ENRICH results especially when compared with Repository as a whole. The Repository’s evaluation is rather good also. Looking in general, all these aspects were evaluated well, the points assigned to those categories (from the maximum possible 100 points) all are good. According to the Methodology for Evaluation, described in D 7.1, the score falling in the interval 0 – 25 means that the result is low, 26 – 75 it is rather good (satisfactory), 76 – 100 it is very good. Therefore the Tools, Objects and Processing received very good evaluation from the ENRICH partners and other related institutions.

Fig.24. The percentages of the four Main Criteria reflecting a quality of the ENRICH project results, achieved in WP 3, WP 4, WP 5, WP 5 n and WP 6 as they were evaluated by 205 respondents during several evaluation sessions.

 

The approximate upper and lower confidence limits (CL) derived for each Criterion. Interoperability and Adaptability received very high scores and those are different from the evaluated Multilinguality or Usability properties with the 0,95 confidence level. Results displayed in the Fig.24 show that we can conclude with the 0,95 confidence level that the Interoperability and Adaptability are the best ENRICH results when compared with other Criteria such as Multilinguality, Security or Usability. Looking in general, all involved Criteria have got the very high estimates, even the lowest result 69,75 assigned to Multilinguality (from the maximum possible 100 points) is rather good. The Interoperability, Adaptability and Usability received very good evaluation or the excellent mark from the ENRICH partners and users from other related institutions.

Fig.25. The percentages of satisfaction of target groups on a quality of the ENRICH project results, achieved in WP3, WP 4, WP 5, WP5n and WP 6, as they were evaluated by 205 respondents during several evaluation sessions.

The approximate upper and lower confidence limits derived for each group of users show that there are no significant difference among experts’ users opinions – almost all intervals are overlapping. The real difference can be stated only between opinions of experts groups and the general users. The general users are more satisfied with the ENRICH project results than the experts – content providers, technical staff and scholars. The tendency that estimates of general users’ are higher than other users groups were noticed in all aspects of investigation performed, but a real reason of this effect is unclear, one can only guess (but this lies outside our Evaluation Report). We can confirm only, that during the process of evaluation, at first stages it was not possible to make similar conclusions on target groups of users until sufficient sample sizes in the separate groups were not attained at the final stage of evaluation.
Finally we have the sample size equal to 205 respondents containing: content providers-information managers (106), technical personnel-supporting staff (57), scholars – researchers in historical documents, students (20), and the general or the end-users having general interests (22). This enables us to make a statistically reliable inference that general users are more satisfied with a quality in digital repository than more qualified expert users.

To top

V. Conclusions and Comments of Obtained Results 

  1. The measures evaluating a quality of results achieved in the ENRICH project have been created and reflects a satisfaction of users in target groups using the results developed in project: digital objects, tools, processing and usability of whole repository as well as the main principles of quality: Interoperability, Adaptability, Usability, Security, Multilinguality.
  2. Numerical evaluation results are comparable to each other and evidently show the weak and the strong points in the facilities developed or other digital repository aspects.
  3. The summarized results reflecting the work in the project during 24 months, based on the 205 respondents opinions in total sample, enables to make statistically correct inferences with the 0,95 probability. They are the following:

    Interoperability and Adaptability received very high scores and those are significantly different from the evaluated Multilinguality, Usability and Security properties (Fig.24 and the Table II below).

    Processes, Objects and Tools were evaluated better than whole Repository (Fig.23 and the Table III)

    The quality of the WP 6 results concerning multilingual access was evaluated to be significantly lower than other work packages: WP 3, WP 4, WP 5, WP 5n (Fig.22 and the Table I).

    § All results are evaluated rather highly: the most optimistic were the general users, less – the experts (Fig.6). Statistically their opinions are different with the 0,95 probability (Fig.25 and the Table IV).

  4. The fact that general users almost always were more optimistic while evaluating a quality in various aspects can be explained that specific, rather technical testing-validating tasks were hardly accessible to non-professional users – their estimates are too optimistic and could be not very reliable, but validity of results on the differences among those target groups are proved statistically. The evaluations of experts are rather similar in all considered aspects and demonstrate their rather high satisfaction by functionalities and properties developed in ENRICH but they are less optimistic than general users’ evaluations.
    First of all the tasks formulated and performed in each work package (WP) were evaluated separately by asking users in the project partners institutions how they estimate the specified results achieved in that WP applying the scale from 0 to 4. The double average (over respondents and questions asked) was taken as an estimate of a quality. Those results of quality evaluation are summarized in the Table I.

Table I. Work Packages. Estimated Scores and Their Percentages of the Maximum Possible

Therefore we do not focus on those estimates only. The main idea is to extract from those estimates the information related to the Main Quality Criteria and Categories. The relationships of WP 3 – WP 6 items and Quality Criteria were established in advance, in the Methodology D-7.1, and the estimates of Criteria and Categories derived from the pooled sample from several evaluation actions performed in the framework of the WP 7. Recalculated estimates, displayed in the Fig. 23, show that we can conclude with the 0,95 confidence level that the developed Processes, Tools and Objects are equally good evaluated as the ENRICH results, especially when compared with facilities in Repository as a whole. But the Repository evaluation is also rather good. Looking in general, all these aspects were evaluated well, the points assigned to those categories (from the maximum possible 100 points if we use percentages) all are good.

Similarly, results displayed in the Fig.24, show that we can conclude with the 0,95 confidence level that the Interoperability and Adaptability are the best ENRICH results when compared with other Criteria such as Multilinguality, Security or Usability. Looking in general, all involved Criteria have got very high estimates, even the lowest result 69,75 assigned to Multilinguality (from the maximum possible 100 points) is rather good. The Interoperability, Adaptability and Usability received very good evaluation or the excellent mark from the ENRICH partners and users from other related institutions. Let us remember that according to the Methodology for Evaluation (D-7.1) the score falling in the interval 0 – 25 means that the result is low, 26 – 75 it is rather good (satisfactory), 76 – 100 it is very good.

Considering a satisfaction of target users groups, the approximate upper and lower confidence limits derived for each group of users show that there are no significant difference among the experts’ users opinions – almost all intervals in the Fig.25 are overlapping. The real difference can be stated only between opinions of experts groups and the general users. The general users are more satisfied with the ENRICH project results than the experts – content providers, technical staff and scholars.

The results of the quality evaluation performed on the base of total sample of 205 respondents expressing their opinion on the achievements of the project ENRICH are summarized in the following three tables.

Table II. Quality Criteria – Estimated Scores and Their Percentages of the Maximum Possible Values

 

Table III. The Categories – Estimated Scores and Their Percentages of the Maximum Possible Values

 

Table IV. Satisfaction of Target Groups’ Users – Estimated Scores and Their Percentages of the Maximum Possible Values

The evaluation process was composed of the several actions performed from the April to the December 2009 and the evaluation results were in a permanent change when more data were obtained. The dynamic of those changes was fixed in the Progress Reports made every a half year. The last evaluation data show a real stabilization of estimates concerning the Criteria or Users satisfaction; they added minor changes to results obtained in the previous evaluations. This means that we have consistent estimates of quality in all aspects.

The original methodology for evaluation of quality in digital repository were developed enabling to obtain an universal estimator of quality, not dependent on the number of criteria used for evaluation or individual opinions of evaluators. The structural model proposed enables the multifaceted aspects to be evaluated, to establish relationships between Categories and Criteria and to have a matrix of those relationships across the set of Criteria / Categories. The approximate confidence intervals of proposed estimators of quality, fitted to each items under investigation, enables to make reliable statistical inferences on differences of estimated values or their comparability to each other with the fixed confidence level, usually with the 0,95 probability.

 

References

[1] Manuscriptorium Digital Library http://www.manuscriptorium.eu (www.manuscriptorium.com) [accessed on 5 December 2009]
[2] ENRICH - European Networking Resources and Information concerning Cultural Heritage project (2007 - 2009) http://enrich.manuscriptorium.com/ [accessed on 5 December 2009]
[3] MINERVA Technical Guidelines document, [accessed on 5 December 2009] http://www.minervaeurope.org/publications/technicalguidelines.htm
[4] Brooke, John (1986) SUS - A quick and dirty usability scale, [accessed on 5 December 2009] http://www.usabilitynet.org/trump/documents/Suschapt.doc
[5] UsabilityNet: Usability Resources for practitioners and managers http://www.usabilitynet.org/home.htm; http://www.usabilitynet.org/tools/methods.htm [accessed on 5 December 2009].
[6] Kligiene, Nerute, (2009) E-Accessibility Marking a Quality of Digital Repository, Proceedings of 2nd International Multi-Conference on Society, Cybernetics and Informatics v. II, July 10-13, 2009, Orlando, Florida, USA, p.p. 167- 172.
[7] Kligiene, Nerute, Structural Model for Digital Repository Quality Evaluation in Context of Usage, Proceedings of eChallenges 2009 Conference, 21-23 October, Istanbul.
[8] Quality principles for Cultural Websites: a Handbook, 2005 Minerva Project, http://www.minervaeurope.org/userneeds/qualityprinciples.htm [accessed on 5 December 2009]
[9] Testing of e-Applications Developed in ENRICH: Usability, Evaluation of Migration Tool, Personalized Translation, Personalization for Contributors and Users www.musicalia.lt/sus/; www.musicalia.lt/eta/ [accessed on 5 December 2009]

To top

 

Evaluation and Testing Applied to ENRICH WP 3, WP 4, WP 5, and WP 6

 

The all project partners have to participate in evaluation process. Thank you for your efforts in evaluating the first results derived in WPs until December 2008.

 

Home