Assessing the applicability of machine learning techniques on data sets

Luca Papariello, Mihnea Tufis, EURECAT, WP4

An important component of the Safe-DEED project is concerned with data valuation. This component receives a data set (or a snapshot thereof) and returns an estimate for the value of a data set. Assigning a value to information items is a complex problem and research in this field is still in its infancy. Therefore, the goal of the Data Valuation Component (DVC) is to go beyond simply assigning a price label to a given data set. Instead, according to the functional requirements that were developed for the Data Valuation Component (Deliverable 4.1), the data valuation will be performed over 3 aspects:
1. data quality;
2. data exploitability;
3. economic value.

At a high level, the quality of a data set will determine the opportunities for its exploitability, whereas both these aspects will determine the economic value of the data set, in a given context. While qualitative aspects of data sets have been thoroughly studied, their exploitability through the application of machine learning techniques is usually appreciated only in advanced stages of data-centered projects.

One of the purposes of the data value component is to devote more attention to this issue. More precisely, the problem we are working on is the one of estimating the effect of chance on the accuracy of machine learning algorithms. For example, consider a binary classification problem, that is, we want to establish whether something belongs to class A or B. If the two classes have approximately the same size (i.e. there are as many A-objects as B-objects), then an algorithm that predicts A or B by tossing a coin would have (on average) 50% accuracy. This means that the algorithm is not doing very well: it didn’t learn any pattern hidden in the data. We would say that an algorithm is doing well if it correctly predicts much more than 50% of the objects.

Currently we are working on the generalization of this observation to data sets with multiple classes (and potentially of different sizes). The objective is to provide an estimate of the accuracy that one would get by pure chance and will depend on parameters such as the number of samples and features, and the relative size of the different classes. In this way, similarly to what we discussed above, we could say that an algorithm is good if it performs better than what it would do by pure chance.

This solution will plug into the DVC to provide a measure of how applicable selected machine learning techniques are given an input data set. Together with the qualitative evaluation of the data set, the exploitability measure will be fed into an economic model that will determine the value of the data set. Thus, a data set that is more prone to the effects of chance will hint that it is harder to extract valuable information and will be valued accordingly.