Results of Evaluation and Testing in WP 3 – WP 6 and Quality Criteria

 

The evaluations of results achieved in the project ENRICH Work Packages (WP), namely WP 3 – WP 6, started in April 2009 and was finished in early October. The WP 5 was evaluated May – June, WP 4 – September. In total 174 respondents were involved into process of evaluation by filling the evaluation forms online. Each respondent had to express own opinion on the specified result, assigning a score – an integer ranging from 0 (poor or not available) to 4 (excellent). The double average values over respondents and questions evaluated ensured more stable estimator of quality. The structure of respondents profile and their general opinions on results achieved during 23 months of project work are illustrated in the Fig.1 and Fig.2. The results of evaluation were analyzed in many aspects and are presented in this report: illustrated by Fig. 1–21 with corresponding comments.
First of all, the results were considered across the different users groups in order to investigate how their needs were satisfied, see Fig. 2–15. Secondly, the average scores, assigned to the separate questions on results achieved in WPs by all target groups of respondents, were recalculated as estimates of Categories and Quality Criteria over investigated WPs, the results are shown in the Fig. 16–17. The approximate 0,95 confidence intervals were fitted to the estimators of quality considered in all aspects: WPs, Criteria, Categories and opinions of target groups in order to make statistically reliable inference. It is shown that with the 0,95 confidence level we can confirm that the created Processes, Tools and Objects are the strongest properties achieved in the ENRICH results. Similarly, the properties of Interoperability and Adaptability are the best (with the 0,95 confidence level) in ENRICH results when compared with other quality Criteria such as Multilinguality or Usability. The numerical values of estimated quality aspects are shown in diagrams using two ways: the average scores ranging from 0 to 4 (Fig. 1–18) and their percentages (Fig. 19 – 21) with the lower and the upper confidence limits.

 

I . The Results Across the Different Target Users Groups

The total number of the respondents involved into evaluation of ENRICH results and testing activities was 174. The four target users groups were considered: content providers-information managers (84), technical personnel-supporting staff (51), and scholars – researchers in historical documents, students (20), and the general or the end-users having general interests (19). The structure of respondents corresponds well to the aims of the project which is oriented more to the experts in the area than to the users of general interest. Regretfully, the activity of scholars during whole testing period was rather low; it is more comparable to general users than to experts. But certainly it reflects the real structure of users because the ratio of target groups was rather stable during the evaluation process as seen from Fig. 1, cases (a) and (b).

(a)

(b)

Fig.1. (a) The distribution of 174 respondents over the four target users groups: content providers – information managers, technical personnel – supporting staff, scholars – researchers in historical documents, students and the general or the end-users having general interests. (b) The distribution of 116 respondents at the 18th month of project work, the proportions of different users evaluating results were almost stable.

 

Fig.2. Project results in WP3, WP4, WP5, WP6 evaluated by target users groups, compared to the maximum possible score 4.

 

The summary results shown at the Fig. 2 demonstrate rather similar opinion of scholars, technical staff and content providers while general users seem to be more satisfied with ENRICH project results than the experts. The results are rather similar to those derived at the month 18th of ENRICH work (6 months to the project end). In the last part of this report we will consider the confidence limits for each group of users in order to make a statistically correct conclusion.

Fig.3. The investigated properties (13 questions asked) concerning the quality in WP 3, WP 4, WP 5, and WP 6, evaluated by target users groups (the individual scores were ranging from 0 to 4). The average scores of each property are marked numerically on the blue line showing the average.

 

Fig.4. The sum of scores is the highest in WP3b = 15,02 and WP5 = 14,76. The spread of the opinions among the users groups is rather equal in the WP 3, WP 4 and WP 5 questions but have much larger variations in the WP 6 quality evaluation. The smallest summary value is for WP 6 d = 9,28, it contains also the smallest score assigned by scholars which is equal to 1,33.

 

Fig.5. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 3 results, evaluated by different users groups. The average of WP 3 (located at the right of this diagram) shows the spread of opinions on quality in WP 3 by target users. The total average of WP 3 is equal to 3,32.

The questions in WP 3 were rather technical and certainly difficult to access for general users and sometime for scholars – researchers in the historical documentary heritage. Investigating more thoroughly the results in Fig. 5 we see that the four extreme values equal to 4 are allocated at WP3-a and WP3-b by scholars and general users and they can affect significantly the average of WP 3. Let us exclude the extreme values and apply the stratified sampling for evaluating WP 3 results only by real experts: content providers, information managers (15 respondents in WP 3) and technical personnel, supporting staff (17 respondents in WP 3). Happily, they were in majority of this sample (compared to 2 scholars and 4 general users only).

Fig.6. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 3 results, evaluated only by expert users groups. The average at the right of this diagram shows the total averages assigned by content providers and technical staff and the total average of WP3 given by experts, it is equal to 3,26.

 

Fig.7. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 4 results, evaluated by target users groups. The average calculated across the WP 4 questions shows the total evaluation of a quality in WP 4. The average of WP 4 in total is equal to 3,27.

 

Fig. 8. The averages of the numerical values, assigned to each of the three questions, reflecting the quality of WP 5 results, evaluated by target users groups. The average calculated across the WP 5 questions shows the total evaluation of a quality in WP 5. The average of WP 5 in total is equal to 3,32.

 

Fig. 9. The averages of the numerical values, assigned to each of the four questions, reflecting the quality of WP 6 results, evaluated by target users groups. The average located at the right of this diagram shows the total evaluation of a quality in WP 6 by different users. The average of WP 6 in total is 2,66.

To top

II. The Scores in Work Packages Assigned by All Users

Fig.10. The average scores of 3 questions on a quality in WP 3 evaluated by full sample (38 respondents) of users. The four extreme values equal to 4, assigned by scholars and general users, located in WP3(a) and WP3(b).

 

Fig.11. The average scores of 3 questions on a quality in WP 3 evaluated by stratified sample including only the expert users. Total number of respondents there was 32. The extreme values, affecting the final average, were excluded. By experts evaluation the average of WP 3 in total is 3,26 (compare to a the previous result 3,32).

 

Fig.12. The average scores of 3 questions on a quality in WP 4 evaluated by full sample (50 respondents) of users.

 

Fig.13. The average scores of 3 questions on a quality in WP 5 evaluated (by 49 respondents) by all users of WP 5.

 

Fig.14. The average scores of 4 questions on a quality in WP 6 as were evaluated by all users (37 respondents). The scholars have the most spread opinions in contrast to general users having almost uniformly good opinion.

 

Fig.15. The average total scores, reflecting a quality in WP 3, WP 4, WP 5 and WP 6, evaluated by different users. It seems that the quality of results achieved in the WP 3 and WP5 were evaluated higher than in WP 6 – that is an integrated opinion of 174 respondents involved into final evaluation.

 

Let us compare the total averages given by various users to WP 3, WP 4, WP 5 and WP 6. Those quantities are 3,32 (or 3,26 evaluated only by experts), 3,01, 3,32 and 2,66, respectively. Are those estimates significantly different? The problem of statistical inference is as follows: say, testing a hypothesis that WP 3 results are better than those of WP 6, will be addressed in the IV section of this report, see Fig. 18.

To top

III. The Scores in Categories and Main Criteria

Now let us derive the average scores across the categories (digital objects, tools developed in ENRICH project, processing, and a repository as a whole) and the Main Criteria in quality from all collected data from 174 respondents participating in the testing and evaluating activities.

Fig.16. The average scores of the four categories as the components of a quality in ENRICH project results, achieved in WP 3, WP 4, WP 5 and WP 6, as they were evaluated by 174 respondents. The maximum possible score is 4. Visually the weakest point is in the category Repository’. It will be checked in IV section of this report.

 

Fig.17. The average scores of the five Main Criteria reflecting the quality of ENRICH project results, achieved in WP 3, WP 4, WP 5 and WP 6, as they were evaluated by 174 respondents. The Interoperability and Adaptability seems to be estimated better than Usability and Multilinguality. Are those differences significant?

 

The question of correct comparison of available evaluation results will be answered in the following section.

To top

IV. Statistical Inference on Derived Results –Confidence Limits of Estimators

In order to test correctly the statistical hypothesis that say the WP 3 results are better than those of WP6, let us fix the standard significance value 0,05, corresponding to 0,95 confidence level (and the critical value 1,96 in the normal approximation of statistics used). Then the following 0,95 confidence interval (N. Kligiene, 2009) is fitted: p*+ 1,96 [p* (1- p*) / 4nk]1/2. Here p* is an average quality estimator divided by 4 – the maximum possible its value, n – number of respondents, k – number questions used for evaluation. Applying this formula we have the confidence intervals for WPs investigated, summarized in the following table.
 

Because the confidence intervals for WP 3 results (in both cases) and WP5 are overlapping we can conclude that there is no significant difference among them. But evidently a different conclusion follows when considering a difference in quality of WP 3 (or WP 5) and WP 6 – it is significant and with the probability equal 0,95 we can confirm that results in WP 3 are better than in WP 6. The same conclusion evidently seen from the Fig. 18.

Fig.18. The average evaluation of a quality in the ENRICH project results, achieved in WP 6, WP 5, WP 4, and WP 3 evaluated in two ways: experts only and by all users. The approximate upper and lower confidence limits (CL) derived for each case. The results are related to those in Fig.15 and show evidently that the only significant difference in lower quality what can be concluded is in the WP 6. Other results are equally good.

 

Fig.19. The percentages of the four categories as the components of a quality in the ENRICH project results, achieved in WP 3, WP 4, WP 5 and WP 6 how they were evaluated by 174 respondents. The approximate upper and lower confidence limits (CL) derived for each category.

 

Results displayed in the Fig. 19 show that we can conclude with the 0,95 confidence that the developed Processing, Tools and Objects are equally good evaluated as ENRICH results especially when compared with Repository as a whole. The Repository evaluation is rather good also. Looking in general, all these aspects were evaluated well, the points assigned to those categories (from the maximum possible 100 points) all are good. According to the Methodology for Evaluation, described in D 7.1, the score falling in the interval 0 – 25 means that the result is low, 26 – 75 it is rather good (satisfactory), 76 – 100 it is very good. Therefore the Tools, Objects and Processing received very good evaluation from the ENRICH partners and other related institutions.

Fig.20. The percentages of the four Main Criteria reflecting a quality of the ENRICH project results, achieved in WP 3, WP 4, WP 5 and WP 6, as they were evaluated by 174 respondents. The approximate upper and lower confidence limits derived for each Criterion. Adaptability and Interoperability received very high evaluations and those are significantly different from the evaluated Multilinguality or Usability properties.

 

Results displayed in the Fig.20 show that we can conclude with the 0,95 confidence that the Interoperability and Adaptability are the best ENRICH results when compared with other Criteria such as Multilinguality, Security or Usability. Looking in general, all involved Criteria have got very high estimates, even the lowest result 69,75 assigned to Multilinguality (from the maximum possible 100 points) is rather good. The Interoperability, Adaptability and Usability received very good evaluation or the excellent mark from the ENRICH partners and other related institutions.

Fig.21. The percentages of satisfaction of target groups on a quality of the ENRICH project results, achieved in WP3, WP 4, WP 5 and WP 6, as they were evaluated by 174 respondents. The results are related to Fig.2.

The approximate upper and lower confidence limits derived for each group of users show that there are no significant difference among experts’ users opinions – almost all intervals are overlapping. The difference can be stated only between opinions of content providers and general users – the general users are more satisfied with ENRICH project results than experts – content providers, technical staff and Scholars.

To top

V. Conclusions and Comments

In order to test correctly the statistical hypothesis that say the WP 3 results are better than those of WP6, let us fix the standard significance value 0,05, corresponding to 0,95 confidence level (and the critical value 1,96 in the normal approximation of statistics used). Then the following 0,95 confidence interval (N. Kligiene, 2009) is fitted: p*+ 1,96 [p* (1- p*) / 4nk]1/2. Here p* is an average quality estimator divided by 4 – the maximum possible its value, n – number of respondents, k – number questions used for evaluation. Applying this formula we have the confidence intervals for WPs investigated, summarized in the following table.

  1. The measure evaluating a quality of results achieved in ENRICH project have been created and reflects a satisfaction of users in target groups using the results developed in project: digital objects, tools, processing and usability of whole repository.
  2. Numerical evaluation results are comparable to each other and show the weak and strong points in the activities and different other aspects detailed in the next lines.
  3. The summarized results of the evaluated quality in the ENRICH project are the following:
    • All results are evaluated rather high: the most optimistic were general users, less – the experts (Fig.2). Statistically their opinions are different with probability 0,95 (Fig.21)
    • Processes, Objects and Tools were evaluated better than whole Repository by 174 respondents with probability 0,95 (Fig.19)
    • Adaptability and Interoperability received very high scores and those are significantly different from the evaluated Usability, Security and Multilinguality properties (Fig.20).
    • The quality of WP 6 results was evaluated to be significantly lower than WP 3, WP 4 or WP 5 (Fig.18).
  4. The fact that general users almost always were more optimistic while evaluating a quality in various aspects can be explained that the testing-validating tasks are rather specific technically and probably were hardly accessible to non-professional users. Therefore too high estimates of general users could be treated with some reserve. From other side, the experts evaluations in all considered aspects are rather similar and demonstrate their rather high satisfaction by functionalities and properties developed in ENRICH – that is the main conclusion from testing and evaluating the accessibility, usability and adaptability of developed applications.

To top

 

Evaluation and Testing Applied to ENRICH WP 3, WP 4, WP 5, and WP 6

 

The all project partners have to participate in evaluation process. Thank you for your efforts in evaluating the first results derived in WPs until December 2008.

 

Home