博客内容Blog Content

机器学习研究总结 Summary of Machine Learning Study

BlogType : Machine Learning releaseTime : 2024-10-13 15:00:00

在研究了一段时间的机器学习核心理论与实战后，总结一下学到的核心思想与实践方法 After spending some time learning the core theories and practical applications of machine learning, I’ve summarized the key concepts and practical methods I’ve learned.

前言 Preface

对于机器学习这块相关的内容，最开始我的印象还停留在大学课程期间做的一些数据挖掘和预测任务。在工作之后我曾尝试再次进行学习研究，但即便网上能参考的资料有很多，其实学习门槛还是并不低，很多内容都很难直接理解，尤其对于各种看不懂的公式和代码就足以让人头晕目眩，实际写代码遇到不懂的地方得不停上网查资料分析，整个效率非常之低，因此之前也没有能够坚持深入学习这块内容。

When it comes to machine learning, my initial impression dates back to the data mining and prediction tasks I did during university courses. After starting work, I attempted to study and research this field again, but despite the abundance of online resources, the learning curve remained quite steep. Many concepts were difficult to grasp directly, especially the various equations and code snippets that were overwhelming. When writing code, I often encountered unfamiliar areas, forcing me to constantly search the web for explanations and analyses, which resulted in extremely low efficiency. As a result, I wasn’t able to persist in deeply learning this subject.

但自从ChatGPT出来之后，一切都变了，学习和梳理知识和调试代码的成本一下就低了下来，这也让我有能力开始系统的学习这块内容。可能很多人都觉得应当先把Python的基础打牢固再进行后续的学习，但我觉得这样可能会花费较多时间，从而耽搁后续重点内容学习，对于Python其实可以通过实际案例边练边学，把重点放在机器学习原理与应用中。

However, since the release of ChatGPT, everything has changed. The cost of learning, organizing knowledge, and debugging code has significantly decreased, which has empowered me to systematically start studying this field. Many people may feel that one should first solidify their foundation in Python before moving on to further studies, but I think this approach could take too much time and delay the learning of more important content. In fact, Python can be learned through hands-on practice with real-world examples while focusing on the principles and applications of machine learning.

另外，虽然目前工具包已经非常成熟了，但掌握算法原理与实际应用都是很重要的，因为做一件事情不能盲目去做，需要知道为什么要这么做!工具包也一样，不仅要学会使用它，更要知道其中每一个参数的作用，以及每一步操作在算法中都是什么含义。遇到问题或者不理解的地方还需要多问问ChatGPT，边学边查，也就是“哪里不会点哪里”。所以不要惧怕数学，也不要过于钻牛角尖，理解即可。

Additionally, although the toolkits available today are already very mature, mastering both the underlying algorithms and practical applications is crucial. You can’t blindly follow a path without understanding why you’re taking it! The same goes for toolkits—not only should you learn how to use them, but also understand the role of each parameter and what each step means in the algorithm. When facing problems or areas of confusion, don’t hesitate to ask ChatGPT—learn and inquire as you go, which is akin to "clicking wherever you don’t understand." So, don’t fear the math, and don’t get too caught up in the details—just focus on understanding the concepts.

学习路线回顾 Review of Learning Path

我的整体学习路线如下:

第1步：学习了Python必备的工具包，包括科学计算库 Numpy、数据分析库 Pandas、可视化库 Matplotlib;
第2步：学习了机器学习中的经典算法，例如回归算法、决策树、集成算法、支持向量机、聚类算法等
第3步：学习了深度学习中的常用算法，包括神经网格、卷积神经网络、递归神经网络;

每一步都配套有项目实战，基于真实数据集，将学到的框架或模型应用到实际业务数据中。

My overall learning path was as follows:

Step 1: Learned essential Python toolkits, including the scientific computing library Numpy, the data analysis library Pandas, and the visualization library Matplotlib.
Step 2: Studied classic machine learning algorithms, such as regression algorithms, decision trees, ensemble algorithms, support vector machines, and clustering algorithms.
Step 3: Explored common deep learning algorithms, including neural networks, convolutional neural networks, and recurrent neural networks.

Each step was accompanied by hands-on projects, where the frameworks or models I learned were applied to real business data based on actual datasets.

内容大纲 Outline of Content

回归 (Regression)

用于预测连续值变量，通过建立变量之间的关系来进行预测。

Used to predict continuous variables by establishing relationships between variables.

    算法 algorithms
      线性回归 linear regression
      随机森林回归 random forest regression
      神经网络回归 neural network recession
    案例 examples
      随机森林预测气温 Random Forest Temperature Prediction

分类 (Classification)

用于将数据分配到预定义的类别中，基于输入数据的特征进行分类。

Used to assign data into predefined categories based on input features.

    算法 algorithms
      K近邻分类 KNN classification
      逻辑回归分类 logistic regression classification
      随机森林分类 random forest classification
      朴素贝叶斯分类 naive Bayes classification
      支持向量机分类 support vector machine classification
      神经网络分类 neural network classification
    案例 examples
      信用卡欺诈检测 Credit Card Fraud Detection
      Mnist手写字体图像分类 Mnist Handwritten Digit Image Classification
      CIFAR-10图像分类 CIFAR-10 Image Classification

聚类 (Clustering)

用于将相似的数据点分组，无需预先定义类别，通过相似性进行分组。

Used to group similar data points without predefined categories, based on similarity.

    算法 algorithms
      K均值聚类 K-means
      DBSCAN聚类 DBSCAN clustering
    案例 examples
      啤酒数据聚类 Beer Data Clustering
      国会议员投票数据分析 Analysis of Congressional Voting Data

降维 (Dimensionality Reduction)

用于减少数据特征数量，同时保留重要信息，以简化数据分析和可视化。

Used to reduce the number of features while retaining important information to simplify data analysis and visualization.

    算法 algorithms
      主成分分析 principal component analysis
      线性判别式 linear discriminant analysis
      矩阵分解 matrix decomposition
    案例 examples
      “鸢尾花”数据降维 Dimensionality Reduction of "Iris" Dataset

Github：https://github.com/luguanxing/MachineLearning-Study

未来展望 Future Prospects

展望未来，作为一名数据工程师，在掌握了强力的机器学习算法原理和对应模型之后，我认为我应将其有效地应用到实际场景中。因我之前在券商工作过，同时作为一名股市老韭菜，我后续可能会开发自己的量化模型，结合各类数据对股票走势进行量化分析，并会尝试用真金白银进行测试，并进一步根据运行结果和学习到的理论知识不断完善和优化这个量化模型，因为我认为只有兴趣才是最好的老师。

Looking ahead, as a data engineer, after mastering the principles of powerful machine learning algorithms and corresponding models, I believe I should effectively apply them to real-world scenarios. Since I have previously worked in a brokerage firm and am also a seasoned stock market participant, I may develop my own quantitative models in the future, combining various data to perform quantitative analysis of stock trends. I will also try to test these models with real money and further refine and optimize them based on the results and the theoretical knowledge I acquire. This is because I believe that passion is the best teacher.