The progress made by AI over the last few years is remarkable. What started as an alien technology has now become embedded in every walk of life. AI is helping people and organizations augment human intelligence almost everywhere. However, this progress wouldn’t have been possible without the availability of increasingly large and diverse research datasets.
These datasets are collections of images sampled from the internet that provide a better representation of the statistics than images taken in the laboratory. It’s because of these datasets that we’re able to generalize to the real world better. They also allow reproducible and quantitative comparison of algorithms enabling researchers to efficiently build on each other’s work.
Modern machine learning relies on these diverse datasets to function. However, these datasets have technical and ethical shortcomings:
One, there is a copyright issue. While some datasets contain images licensed for use, some do not. They contain personal information collected via unsupervised methods and unclear license usage.
Two, it violates data protection. The majority of these images are collected by humans for their consumption and thus a large number of them contain people. As it is nearly impossible to obtain consent from all those people, the data is collected without consent.
To address these challenges and give the datasets a technical, ethical, and legal perspective, the Visual Geometry Group, University of Oxford has proposed unlabeled datasets. Named PASS: Pictures without humAns for Self-Supervision, the dataset only contains images with CC-BY license and complete attribution metadata. Also, it contains no images of people at all and avoids all images problematic for ethics and data protection.
“We do so by starting from a large-scale (100 million random Flickr images) dataset— YFCC100M meaning that the data is better randomized and identify a ‘safer’ subset within it. We also focus on data made available under the most permissive Creative Commons license (CC-BY) to address copyright concerns. Given this data, we then conduct an extensive evaluation of SSL methods, discussing performance differences when these are trained using ImageNet and PASS” the research team conceded in its paper.
“The annotators were asked to identify images that contain people or body parts, as well as personal information such as IDs, names, license plates, signatures, etc. Additionally, the annotators were asked to flag images with problematic content such as drugs, nudity, blood, and other offensive content. From the remaining images (1.46M) we further removed duplicates and randomly selected a subset with approximately the same size as IN-1k (1.440.191 images)”, the team specified.
Compared to ImageNet, PASS datasets enjoys certain advantages:
- It three essential differences: lack of class-level curation, lack of community optimization, and lack of people
- The self-supervised approaches such as MoCo, SwAV, and DINO train very well on the PASS dataset
- The lack of humans does not cause an effect on downstream task performances
- Models trained on the PASS dataset have better results than ImageNet in 8/13 frozen encoder evaluation benchmarks
However, it still has a fair share of limitations. First, despite filtering the images, some harmful content might have slipped through. Second, given the fact that PASS does not contain the existence of people, the model cannot be used to learn models of people, like pose recognition. Third, as PASS contains no labels, it cannot be used alone for training and benchmarking. This means the curated datasets that carry privacy and copyright issues remain necessary.
In such a case it would be interesting to witness how far PASS can go to reduce ethical and legal risks in datasets. Only time can tell whether it can actually curate and improve our datasets and introduce a more realistic training scenario of utilizing images obtained from labelled detests.
Source: indiaai.gov.in