These are the most intriguing artificial intelligence research articles published this year. Innovations in artificial intelligence (AI) and data science are brought together in this approach. The information is presented in a chronological order and a link is provided to an extended article.
Gradient coreset-based replay buffer selection for continuous learning is referred to as GCR.
Continual learning, or CL, is a strategy that aims to develop methods by which a single model may adapt to an increasing number of tasks that are faced sequentially. This could potentially leverage learning across tasks in a way that is resource-efficient. Catastrophic forgetting, on the other hand, is a serious challenge for CL systems. This problem arises when a person is learning a new task and forgets what they have already learned.
Gradient Coreset Replay (GCR) is an innovative method for selecting and updating replay buffers. It makes use of a meticulously developed optimization criterion to achieve optimal performance. To be more specific, they choose and keep track of a “coreset” that is an approximation of the gradient of all the data observed to this point in time in relation to the parameters of the present model. They investigate the necessary approaches for its efficient implementation in an environment of continuous learning and discuss their findings.
The researchers show that their method is superior to the current state of the art by a significant amount (2%-4% absolute) when applied in a well-researched offline continuous learning setting. Their discoveries, when applied effectively to online and streaming CL contexts, demonstrate improvements of up to 5% compared to the solutions that are already in use. In their conclusion, the researchers show that supervised contrastive loss is beneficial for continuous learning. This type of learning results in a cumulative improvement in accuracy of up to 5% when combined with their subset selection technique.
Rotating a frame in order to deceive a DNN is called “Merry Go Round.”
Wearable cameras are responsible for the production of the vast majority of first-person videos that are captured nowadays. Egocentric vision is one of the most difficult problems in computer vision, and the majority of state-of-the-art (SOTA) vision systems rely on deep neural networks (DNNs). On the other hand, it is common knowledge that DNNs are susceptible to Adversarial Attacks (AAs), which are attacks that supplement the input with noise that is not visible to the human eye. In addition, both black-box and white-box attacks on occupations requiring image and video analysis have been shown to be effective.
The researchers observe that the majority of AA methods result in altered image intensities. Even for videos, the process must be carried again repeatedly for each every frame. They emphasise that the concept of imperceptibility that is employed for images may not apply to videos, because in videos, a random shift in intensity can be discernible even when it occurs between two frames that are immediately following each other.
As the most significant new aspect of this research, the authors suggest using optical flow perturbation to a video analysis system in order to carry out AAs. This type of disruption is particularly useful for egocentric movies since egocentric recordings already contain a significant amount of tremor, and the addition of even a small amount extra tremor renders it almost impossible to detect. In a broad sense, our idea can be understood as the addition of structured, parametric noise as the disruptive adversary. Their application of the notion, which included adding 3D rotations to the frames, reveals that using their technique, one is able to mount a black-box AA on an egocentric activity detection system with one-third fewer queries than when using the SOTA AA technique.
Multi-modal Extreme Classification
This research presents the MUFIN method as a solution to extreme classification (XC) problems, which often involve millions of labels and data points that are accompanied by both graphical and textual descriptors. Applications of MUFIN are presented below. These applications include product-to-product recommendation and bid query prediction across millions of products. Embedding-based techniques are typically the sole method upon which modern multimodal techniques rely. On the other hand, XC approaches utilise classifier designs to produce results that are more accurate than those produced by embedding-only methods, while at the same time concentrating mainly on text-based classification issues.
MUFIN develops an architecture that is predicated on cross-modal attention and then trains it in a modular fashion by making use of pre-training, positive and negative mining, and pre-training. From the publicly available listings on Amazon.com, a new dataset called MM-AmazonTitles-300K was compiled for use in making product-to-product recommendations. This dataset contains roughly 300 thousand products, each with its own title and a variety of images. On the MM-AmazonTitles-300K and Polyvore datasets, as well as a dataset with over 4 million labels collected from the Bing search engine click logs, MUFIN provided at least 3% greater accuracy than leading text-based, image-based, and multimodal approaches. This was demonstrated on all three datasets.