|
7 | 7 | "source": [ |
8 | 8 | "## Latent Outlier Exposure for Anomaly Detection with Contaminated Data\n", |
9 | 9 | "\n", |
10 | | - "In anomaly detection, our goal is to identify data points that exhibit systematic deviations from the majority of data in an unlabeled dataset. Usually, it is assumed that we have clean training data without anomalies, but in practice, this may not be the case. To overcome this challenge, [Latent OE-AD](https://proceedings.mlr.press/v162/qiu22b.html) propose a strategy for training our anomaly detector when dealing with unlabeled anomalies. This approach is compatible with a wide range of models. The main idea is to jointly infer binary labels (normal vs. anomalous) for each data point while updating the model parameters. Taking inspiration from the concept of outlier exposure, where synthetically created, labeled anomalies are used, we employ a dual loss method. This means using two losses that share parameters, one for normal data and another for anomalous data.\n" |
| 10 | + "In anomaly detection, our goal is to identify data points that exhibit systematic deviations from the majority of data in an unlabeled dataset. Usually, it is assumed that we have clean training data without anomalies, but in practice, this may not be the case. To overcome this challenge, [Latent OE-AD](https://proceedings.mlr.press/v162/qiu22b.html) propose a strategy for training our anomaly detector when dealing with unlabeled anomalies. This approach is compatible with a wide range of models. The main idea is to jointly infer binary labels (normal vs. anomalous) for each data point while updating the model parameters. Taking inspiration from the concept of outlier exposure, where synthetically created, labeled anomalies are used, **Latent Outlier Exposure (LOE)** employs a dual loss method. This means using two losses that share parameters, one for normal data and another for anomalous data.\n" |
| 11 | + ] |
| 12 | + }, |
| 13 | + { |
| 14 | + "cell_type": "markdown", |
| 15 | + "id": "creative-quebec", |
| 16 | + "metadata": {}, |
| 17 | + "source": [ |
| 18 | + "## Contaminated Data\n", |
| 19 | + "\n", |
| 20 | + "A common assumption in anomaly detection is that there is access to **clean training data** to teach the model what constitutes **normal** samples. However, this assumption is often violated as datasets can be large and uncurated, potentially containing some of the anomalies the model is intended to detect. For instance, a dataset of medical images may already contain cancer images, or datasets of financial transactions could already contain undetected fraudulent activity. Training an unsupervised anomaly detector on such data may result in suboptimal performance.\n", |
| 21 | + "Naively training an unsupervised anomaly detector on such data may suffer from degraded performance.\n", |
| 22 | + "\n", |
| 23 | + "Many anomaly detection approaches rely on the assumption of **a training dataset consisting solely of 'normal' data**, but in real-world scenarios, **unlabeled anomalies may be present in the training data**. This can lead to a decrease in anomaly detection accuracy.\n", |
| 24 | + "\n", |
| 25 | + "In the context of anomaly detection, **contaminated data** refers to a situation where **the training data contains not only 'normal' examples but also anomalous examples that are not properly labeled**. This can negatively impact the performance of anomaly detection algorithms, as they may learn to recognize the anomalies as normal patterns rather than detecting them as anomalies.\n", |
| 26 | + "\n", |
| 27 | + "We refer to the strategy of blindly training an anomaly detector as if the training data was clean as “Blind” training. “Refine”, employs an ensemble of one-class classifiers to iteratively weed out anomalies and then to continue training on the refined dataset. " |
| 28 | + ] |
| 29 | + }, |
| 30 | + { |
| 31 | + "cell_type": "markdown", |
| 32 | + "id": "satisfied-vitamin", |
| 33 | + "metadata": {}, |
| 34 | + "source": [ |
| 35 | + "## Problem Formulation\n", |
| 36 | + "In contrast to most anomaly detection setups, we assume that our dataset is corrupted by anomalies. That means, I **have we\n", |
| 37 | + "assume that a fraction $(1−α)$ of the data is normal, while its complementary fraction $α$ is anomalous.** This corresponds to a more challenging (but arguably more realistic) anomaly detection setup since the training data cannot be assumed to be normal. We treat the assumed contamination ratio $α$ as a hyperparameter in our approach and denote $α$ as the ground truth contamination ratio where needed.\n", |
| 38 | + "\n", |
| 39 | + "The challenge thereby is to simultaneously infer the binary labels $y_i$ during training while optimally exploiting this information for training an anomaly detection model.\n", |
| 40 | + "\n", |
| 41 | + "|We consider two losses. Similar to most work on deep anomaly detection, we consider a loss function $\\mathcal{L}_n^\\theta(\\mathbf{x}) \\equiv \\mathcal{L}_n\\left(f_\\theta(\\mathbf{x})\\right)$ that we aim to minimize over \"normal\" data. The function $f_\\theta(\\mathbf{x})$ is used to extract features from $\\mathrm{x}$, typically based on a self-supervised auxiliary task. When being trained on only normal data, the trained loss will yield lower values for normal than for anomalous data so that it can be used to construct an anomaly score.\n", |
| 42 | + "\n", |
| 43 | + "In addition, we also consider a second loss for anomalies $\\mathcal{L}_a^\\theta(\\mathbf{x}) \\equiv \\mathcal{L}_a\\left(f_\\theta(\\mathbf{x})\\right)$ (the feature extractor $f_\\theta(\\mathbf{x})$ is shared). Minimizing this loss on only anomalous data will result in low loss values for anomalies and larger values for normal data. The anomaly loss is designed to have opposite effects as the loss function $\\mathcal{L}_n^\\theta(\\mathbf{x})$. We define $\\mathcal{L}_a^\\theta(\\mathbf{x})=1 /\\left\\|f_\\theta(\\mathbf{x})-\\mathbf{c}\\right\\|^2$ (pushing abnormal data away from it).\n", |
| 44 | + "\n", |
| 45 | + "Temporarily assuming that all assignment variables $\\mathbf{y}$ were known, consider the joint loss function, **(EQ.1)**\n", |
| 46 | + "$$\n", |
| 47 | + "\\mathcal{L}(\\theta, \\mathbf{y})=\\sum_{i=1}^N\\left(1-y_i\\right) \\mathcal{L}_n^\\theta\\left(\\mathbf{x}_i\\right)+y_i \\mathcal{L}_a^\\theta\\left(\\mathbf{x}_i\\right) \n", |
| 48 | + "$$" |
| 49 | + ] |
| 50 | + }, |
| 51 | + { |
| 52 | + "cell_type": "markdown", |
| 53 | + "id": "extra-wayne", |
| 54 | + "metadata": {}, |
| 55 | + "source": [ |
| 56 | + "This equation resembles the log-likelihood of a probabilistic mixture model, but note that $\\mathcal{L}_n^\\theta\\left(\\mathbf{x}_i\\right)$ and $\\mathcal{L}_a^\\theta\\left(\\mathbf{x}_i\\right)$ are not necessarily data log-likelihoods; rather, self-supervised auxiliary losses can be used and often perform better in practice.\n", |
| 57 | + "\n", |
| 58 | + "\n", |
| 59 | + "\"Hard\" Latent Outlier Exposure $\\left(\\mathbf{L O E}_H\\right)$. In LOE, we seek to both optimize both losses' shared parameters $\\theta$ while also optimizing the most likely assignment variables $y_i$. Due to our assumption of having a fixed rate of anomalies $\\alpha$ in the training data, we introduce a constrained set: **(EQ. 2)**\n", |
| 60 | + "$$\n", |
| 61 | + "\\mathcal{Y}=\\left\\{\\mathbf{y} \\in\\{0,1\\}^N: \\sum_{i=1}^N y_i=\\alpha N\\right\\} .\n", |
| 62 | + "$$\n", |
| 63 | + "The set describes a \"hard\" label assignment; hence the name \"Hard LOE\", which is the default version of LOE approach. Note that we require $\\alpha N$ to be an integer.\n", |
| 64 | + "\n", |
| 65 | + "Since our goal is to use the losses $\\mathcal{L}_n^\\theta$ and $\\mathcal{L}_a^\\theta$ to identify and score anomalies, we seek $\\mathcal{L}_n^\\theta\\left(\\mathbf{x}_i\\right)-\\mathcal{L}_a^\\theta\\left(\\mathbf{x}_i\\right)$ to be large for anomalies, and $\\mathcal{L}_a^\\theta\\left(\\mathbf{x}_i\\right)-\\mathcal{L}_n^\\theta\\left(\\mathbf{x}_i\\right)$ to be large for normal data. Assuming these losses to be optimized over $\\theta$, our best guess to identify anomalies is to minimize Eq. (1) over the assignment variables $\\mathbf{y}$. Combining this with the constraint (Eq. (2)) yields the following minimization problem:\n", |
| 66 | + "$$\n", |
| 67 | + "\\min _\\theta \\min _{\\mathbf{y} \\in \\mathcal{Y}} \\mathcal{L}(\\theta, \\mathbf{y})\n", |
| 68 | + "$$\n", |
| 69 | + "As follows, we describe an efficient optimization procedure for the constraint optimization problem." |
| 70 | + ] |
| 71 | + }, |
| 72 | + { |
| 73 | + "cell_type": "markdown", |
| 74 | + "id": "imposed-organization", |
| 75 | + "metadata": {}, |
| 76 | + "source": [ |
| 77 | + "To this end, we consider a sequence of parameters $\\theta^t$ and labels $\\mathbf{y}^t$ and proceed with alternating updates. To update $\\theta$, we simply fix $\\mathbf{y}^t$ and minimize $\\mathcal{L}\\left(\\theta, \\mathbf{y}^t\\right)$ over $\\theta$. In practice, we perform a single gradient step (or stochastic gradient step, see below), yielding a partial update.\n", |
| 78 | + "\n", |
| 79 | + "To update $y$ given $\\theta^t$, we minimize the same function subject to the constraint (Eq. (2)). To this end, we define training anomaly scores,\n", |
| 80 | + "$$\n", |
| 81 | + "S_i^{\\text {train }}=\\mathcal{L}_n^\\theta\\left(\\mathbf{x}_i\\right)-\\mathcal{L}_a^\\theta\\left(\\mathbf{x}_i\\right) .\n", |
| 82 | + "$$\n", |
| 83 | + "These scores quantify the effect of $y_i$ on minimizing Eq. (1). We rank these scores and assign the $(1-\\alpha)$-quantile of the associated labels $y_i$ to the value 0 , and the remainder to the value 1 . This minimizes the loss function subject to the label constraint.\n" |
| 84 | + ] |
| 85 | + }, |
| 86 | + { |
| 87 | + "cell_type": "markdown", |
| 88 | + "id": "statistical-windsor", |
| 89 | + "metadata": {}, |
| 90 | + "source": [ |
| 91 | + "## Loss Function\n", |
| 92 | + "\n", |
| 93 | + "Rather than using hand-crafted transformations, we utilize NTL(Neural Transformation Learning) method. NTL learns $K$ neural transformations $\\left\\{T_{\\theta, 1}, \\ldots, T_{\\theta, K}\\right\\}$ and an encoder $f_\\theta$ parameterized by $\\theta$ from data and uses the learned transformations to detect anomalies. Each neural transformation generates a view $\\mathbf{x}_k=T_{\\theta, k}(\\mathbf{x})$ of sample $\\mathbf{x}$. For normal samples, NTL encourages each transformation to be similar to the original sample and to be dissimilar from other transformations.\n", |
| 94 | + "\n", |
| 95 | + "To achieve this objective, NTL maximizes the normalized probability $p_k=h\\left(\\mathbf{x}_k, \\mathbf{x}\\right) /\\left(h\\left(\\mathbf{x}_k, \\mathbf{x}\\right)+\\sum_{l \\neq k} h\\left(\\mathbf{x}_k, \\mathbf{x}_l\\right)\\right)$ for each view where $h(\\mathbf{a}, \\mathbf{b})=\\exp \\left(\\cos \\left(f_\\theta(\\mathbf{a}), f_\\theta(\\mathbf{b})\\right) / \\tau\\right)$ measures the similarity of two views. For anomalies, we \"flip\" the objective for normal samples: the model instead pulls the transformations close to each other and pushes them away from the original view, resulting in\n", |
| 96 | + "$$\n", |
| 97 | + "\\mathcal{L}_n^\\theta(\\mathbf{x}):=-\\sum_{k=1}^K \\log p_k, \\quad \\mathcal{L}_a^\\theta(\\mathbf{x}):=-\\sum_{k=1}^K \\log \\left(1-p_k\\right) .\n", |
| 98 | + "$$\n", |
| 99 | + "\n", |
| 100 | + "The below image shows the NTL method:\n", |
| 101 | + "\n", |
| 102 | + "<center>\n", |
| 103 | + "<img src=\"https://drive.google.com/uc?export=view&id=1BuWMaMmZiKDFcfTnSciampCG5n8YSiU_\" width=\"750\" aligh=\"center\">\n", |
| 104 | + "</center>\n", |
| 105 | + "\n", |
| 106 | + "As shown in the image, NTL is an end-to-end procedure for self-supervised anomaly detection with learnable neural transformations. Each sample is transformed by a set of learned transformations and then embedded into a semantic space. The transformations and the encoder are trained jointly on a contrastive objective, which is also used to score anomalies.\n", |
| 107 | + "\n", |
| 108 | + "\n" |
11 | 109 | ] |
12 | 110 | }, |
13 | 111 | { |
14 | 112 | "cell_type": "markdown", |
15 | 113 | "id": "acknowledged-million", |
16 | 114 | "metadata": {}, |
17 | 115 | "source": [ |
| 116 | + "# Implementation \n", |
18 | 117 | "## Loading libraries\n", |
19 | 118 | "\n", |
20 | 119 | "The configuration is loaded from the \"config_files\" directory, and you can find more detailed information there. For this task, we are utilizing the [thyroid dataset](https://odds.cs.stonybrook.edu/thyroid-disease-dataset/), which should be present in the \"DATA\" folder." |
|
1612 | 1711 | ], |
1613 | 1712 | "metadata": { |
1614 | 1713 | "kernelspec": { |
1615 | | - "display_name": "ssl_env", |
| 1714 | + "display_name": "ssl", |
1616 | 1715 | "language": "python", |
1617 | 1716 | "name": "venv" |
1618 | 1717 | }, |
|
0 commit comments