博客内容Blog Content
降维算法原理 Dimensionality Reduction Algorithm Principles
简单介绍两种实用的降维方法:LDA线性判别分析和PCA主成分分析 A brief introduction to two practical dimensionality reduction methods: Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA).
背景 Background
如果拿到的数据特征过于庞大,一方面会使得计算任务变得繁重;另一方面,如果数据特征还有问 题,可能会对结果造成不利的影响。降维是机器学习领域中经常使用的数据处理方法,一般通过某种映 射方法,将原始高维空间中的数据点映射到低维度的空间中,本章将从原理和实践的角度介绍两种经典 的降维算法——LDA线性判别分析和PCA主成分分析。
If the features of the obtained data are too large, on one hand, it will make the computational tasks more burdensome; on the other hand, if there are issues with the data features, it may negatively affect the results. Dimensionality reduction is a commonly used data processing method in the field of machine learning. Typically, through some form of mapping, data points in the original high-dimensional space are mapped to a lower-dimensional space. This chapter will introduce two classic dimensionality reduction algorithms—Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA)—from both theoretical and practical perspectives.
线性判别式分析(Linear Discriminant Analysis,LDA)
原理 Principle
假设有两类数据点,如图15-2所示。由于数据点都是二维特征, 需要将其降到一维,也就是需要找到一个最合适的投影方向把这些数据点全部映射过去。图(a)、 (b)分别是选择两个投影方向后的结果,那么,究竟哪一个更好呢?
Assume there are two classes of data points, as shown in Figure 15-2. Since the data points have two-dimensional features, we need to reduce them to one dimension. In other words, we need to find the most suitable projection direction to map all of these data points. Figure (a) and Figure (b) show the results of selecting two different projection directions. So, which one is better?
从投影结果上观察,图(a)中的数据点经过投影后依旧有一部分混在一起,区别效果有待提 高。图(b)中的数据点经过投影后,没有混合,区别效果比图(a)更好一些。因此,我们当然 会选择图(b)所示的降维方法,由此可见,降维不仅要压缩数据的特征,还需要寻找最合适的方 向,使得压缩后的数据更有利用价值。
From the projection results, in Figure (a), some data points are still mixed together after projection, and the distinction between the two classes could be improved. In contrast, in Figure (b), the data points are not mixed after projection, and the distinction is clearer compared to Figure (a). Therefore, we would naturally choose the dimensionality reduction method shown in Figure (b). This demonstrates that dimensionality reduction not only needs to compress the data's features, but also needs to find the most suitable direction, so that the compressed data is more useful.
为了把降维任务做得更圆满,实现“尽可能使两类数据点区别得越明显越好,不要混在一起”的目的,提出了两个目标:
To perform dimensionality reduction more effectively and achieve the goal of “making the two classes of data points as distinguishable as possible without mixing together,” two objectives are proposed:
目标1. 对于不同类别的数据点,希望其经过投影后能离得越远越好。(不同类分散)
目标2. 对于同类别的数据点,希望它们能更集中,离组织的中心越近越好。(同类集中)
Objective 1: For data points of different classes, we hope that after projection, they are as far apart as possible. (Inter-class dispersion)
Objective 2: For data points of the same class, we want them to be more concentrated and as close to the center of their cluster as possible. (Intra-class compactness)
接下来的任务就是完成这两个目标,这也是线性判别分析的核心优化目标,降维任务就是找到能同 时满足这两个目标的投影方向。
The next task is to achieve these two objectives, which is the core optimization goal of Linear Discriminant Analysis (LDA). The dimensionality reduction task is to find a projection direction that satisfies both objectives simultaneously.
求解和计算 Solving and Calculation
要定义一下距离这个 概念,例如“扎堆”该怎么体现呢?这里用数据点的均值来表示其中心位置,如果每一个数据点都离中心 很近,它们就“扎堆”在一起了。中心点位置计算方法如下:
First, we need to define the concept of "distance." For instance, how do we quantify "clustering" or "gathering"? Here, the mean of the data points is used to represent the central position. If each data point is very close to the center, they are "clustered" together. The method for calculating the center point is as follows:
在降维算法中,其实我们更关心的并不是原始数据集中数据点的扎堆情况,而是降维后的结果,因此,可知投影后中心点位置为:
In dimensionality reduction algorithms, we are not as concerned with the clustering of data points in the original dataset, but rather with the result after dimensionality reduction. Therefore, the center point after projection can be determined as:
我们还可以使用另一个度量指标——散列值(scatter),表示同类数据样本点的离散程度,定义如下:
We can also use another metric called the scatter value, which represents the degree of dispersion of data points from the same class. The scatter is defined as follows:
优化的目标有两个,那么如何才能整合它们呢?既然要最大化不同类别之间的距离,那就把它当作分子;最小化同类样本之间的离散程度,那就把 它当作分母,最终整体的J(W)依旧求其极大值即可。
Since there are two optimization objectives, how can we integrate them? Given that we want to maximize the distance between different classes, we can treat that as the numerator. On the other hand, we want to minimize the dispersion among samples of the same class, so we can treat that as the denominator. Ultimately, the overall objective function J(W) will seek to maximize this ratio.
这里需要先求出类内散布矩阵和类间散布矩阵
Here, we first need to calculate the within-class scatter matrix and the between-class scatter matrix.
然后使用拉格朗日乘子法推导,得到
Then, using the Lagrange multiplier method for derivation, we obtain:
观察一下式,它与线性代数中的特征向量有点像,如果把当作一个整体,那么w就是其特征向量,问题到此迎刃而解。在线性判别分析中,其实只需要得到类内和类间散布矩阵,然后求其特征向量,就可以得到投影方向,然后,只需要对数据执行相应的矩阵变换,就完成全部降维操作。
If we observe this equation, it looks somewhat similar to the eigenvector formulation in linear algebra. If we treat as a whole, then w is its eigenvector. At this point, the problem is essentially solved. In Linear Discriminant Analysis (LDA), all we need to do is obtain the within-class and between-class scatter matrices, then compute their eigenvectors to find the projection directions. Afterward, by performing the corresponding matrix transformation on the data, the entire dimensionality reduction process is completed.
注意,根据线性判别分析(LDA)的原理,投影的维度数 n_components 不能超过 min(n_features, n_classes - 1)
An important note: according to the principles of LDA, the number of projection dimensions n_components cannot exceed min(n_features, n_classes - 1)
代码 Code
下面要在非常经典的“鸢尾花”数据集上使用线性判别分析完成降维任务。数据集中含有3 类、共150条鸢尾花基本数据,其中山鸢尾、变色鸢尾、维吉尼亚鸢尾各有50条数据,每条数据包括萼片 长度(单位:厘米)、萼片宽度、花瓣长度、花瓣宽度4种特征。数据集共有150条数据,每条数据有4个特征,现在需要将四维特征降成二维。
Next, we will use Linear Discriminant Analysis (LDA) to perform dimensionality reduction on the very classic "Iris" dataset. The dataset contains 3 classes with a total of 150 Iris flower data points: 50 each for Setosa, Versicolor, and Virginica. Each data point includes 4 features: sepal length (in cm), sepal width, petal length, and petal width. The dataset has 150 data points, each with 4 features, and we need to reduce these four-dimensional features to two dimensions.
from sklearn.preprocessing import LabelEncoder feature_dict = {i:label for i,label in zip( range(4), ('sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', ))} label_dict = {i:label for i,label in zip( range(1,4), ('Setosa', 'Versicolor', 'Virginica' ))} df = pd.io.parsers.read_csv( filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None, sep=',', ) X = df[['sepal length in cm','sepal width in cm','petal length in cm','petal width in cm']].values y = df['class label'].values
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA # 调用包实现LDA sklearn_lda = LDA(n_components=2) X_lda_sklearn = sklearn_lda.fit_transform(X, y) # 查看结果 check result print(X_lda_sklearn.shape) print("特征值(explained_variance_ratio_): ", sklearn_lda.explained_variance_ratio_) print("特征向量(scalings_): ", sklearn_lda.scalings_)
现在可以看到数据维度从原始的(150,4)降到(150,2),到此就完成全部的降维工作。接下来对比 分析一下降维后结果,为了方便可视化展示,在原始四维数据集中随机选择两维进行绘图展示:
Now, we can see that the data dimensions have been reduced from the original (150, 4) to (150, 2), thus completing the dimensionality reduction process. Next, let's compare and analyze the results after dimensionality reduction. For easier visualization, we will randomly select two dimensions from the original four-dimensional data for plotting: