博客内容Blog Content
利用Docker部署大数据环境 Deploying a Real-Time Lakehouse Big Data Environment Using Docker
使用Docker搭建HDFS+Flink+Paimon流批一体环境,进行数据查询,并加入StarRocks查询 Using Docker to Set Up an HDFS + Flink + Paimon Unified Streaming and Batch Environment for Data Querying, Adding StarRocks For Data Querying
概述 Overview
做Flink数据开发时,往往需要一个相对可靠的集群环境,之前我是在mac上分别下载安装hdfs-namenode、hdfs-datanode、flink等组件分别运行,依赖十分复杂,而且如果迁移到别的机器上搭建才发现很困难,于是我就想到使用docker-compose进行搭建(没想到也不简单),这里记录下docker安装hdfs、flink等组件并让其互相调用的过程
When developing Flink data applications, a relatively reliable cluster environment is often required. Previously, I installed and ran components like HDFS NameNode, HDFS DataNode, and Flink separately on my Mac. However, the dependencies were quite complex, and migrating the setup to another machine was difficult. Therefore, I thought of using Docker Compose for the setup (which turned out to be not so simple either). Here, I document the process of installing HDFS, Flink, and other components using Docker and enabling their interaction.
1. Docker安装 Installing Docker
首先需要安装docker,我选择安装了桌面应用,它会有可视化界面还包含相关docker指令,安装之后分配一些cpu和内存资源即可
First, we need to install Docker. I chose to install the desktop application, which provides a graphical user interface and includes relevant Docker commands. After installation, some CPU and memory resources need to be allocated.
安装后,使用docker -ps查看进程,下图为我安装好了flink和hdfs后有如下的进程
Once installed, use docker -ps to check running processes. The following screenshot shows the processes after successfully installing Flink and HDFS.
2. HDFS安装 Installing HDFS
网上找的或GPT给的相关docker-compose.yml大都有问题,比如配置文件里面的配置对不上,启动后datanode没有拉起,我这里提供一个正
确的模版
Most docker-compose.yml files found online or provided by GPT contain issues, such as mismatched configurations in the config files, or the DataNode failing to start. Here, I provide a correct template.
通过docker compose up -d启动后,首先通过docker ps看namenode和datanode是否都正确启动了,然后通过docker logs hdfs-namenode和docker logs hdfs-datanode看日志里面是否有报错(如下图为datanode连不上namenode,因为配置文件不对)
After launching with docker compose up -d, first check whether NameNode and DataNode have started correctly using docker ps. Then, use docker logs hdfs-namenode and docker logs hdfs-datanode to check for errors in the logs. The following screenshot, for example, shows that the DataNode could not connect to the NameNode due to incorrect configuration.
如果没有报错,通过docker exec -it hdfs-namenode /bin/bash进入namenode节点做一些文件操作,例如上传文件
If there are no errors, use docker exec -it hdfs-namenode /bin/bash
to enter the NameNode container and perform file operations, such as uploading files.
3. Flink安装 Installing Flink
安装flink包含jobmanager和taskmanager的配置,注意它们需要分配较多内存才能进行启动,另外flink经常需要引入一些额外的依赖组件,我这里已paimon为例,尝试给flink添加hdfs+paimon的依赖
Installing Flink involves configuring both JobManager and TaskManager. Note that they require a significant amount of memory to start. Additionally, Flink often needs extra dependency components. Here, I take Paimon as an example and attempt to add HDFS + Paimon dependencies to Flink.
首先选择合适的flink版本(目前是1.20),然后去mvn下载对应步骤一里的hdfs版本的flink-hadoop依赖、以及paimon的依赖的jar包,一起放在docker-compose.yml的目录里
First, choose a suitable Flink version (currently 1.20). Then, download the Flink-Hadoop dependency matching the HDFS version from Maven, along with the Paimon dependency JAR files, and place them in the same directory as the docker-compose.yml file.
我们要将这些依赖挂载到docker启动的jm和tm进程里,所以我们要添加对应的挂载配置
We need to mount these dependencies into the Docker containers running the JobManager (JM) and TaskManager (TM), so we add the corresponding mount configuration.
通过docker compose up -d启动后,我们先打开8081端口看看是不是启动起来了
After launching with docker compose up -d, first check port 8081 to see if Flink has started successfully.
我们使用docker exec -it flink-jobmanager /bin/bash进入jm节点,直接./bin/sql-client.sh运行sql-client脚本,然后执行select 1+1,如果出现了结果2说明jm正常执行并提交给了tm执行任务
Then, use docker exec -it flink-jobmanager /bin/bash to enter the JobManager container and run the SQL client script with ./bin/sql-client.sh. Execute SELECT 1+1, and if the result is 2, it means the JobManager is working correctly and successfully submitting tasks to the TaskManager.
下一步,我们验证flink节点能不能访问hdfs,我们在第一部安装hdfs的时候,写了一个csv文件并上传到了hdfs,我们尝试使用flinksql把它读出来
Next, we verify whether Flink can access HDFS. During the HDFS installation step, we uploaded a CSV file to HDFS. Now, we attempt to read it using Flink
能出现数据说明flink节点能正常访问hdfs了
If data appears, it confirms that the Flink nodes can access HDFS correctly.
然后我们再测试下引入的paimon依赖是否能正常使用,根据官网示例代码构建一个hdfs表并执行插入任务看看(hdfs目录需要给flink提前授权):
Finally, we test whether the added Paimon dependency is working properly. Using the example code from the official Paimon documentation, we create an HDFS table and execute an insert operation (ensuring that Flink has prior authorization for the HDFS directory).
任务提上去了,而且checkpoint正常
The task is successfully submitted, and checkpoints are working normally.
使用flink-sql手动进行下select *查询,可以看到实时写入paimon的最新结果
Using Flink SQL, we manually run SELECT * and can see the latest real-time data written into Paimon.
4. StarRocks安装 Install StarRocks
往docker-compose.yml添加StarRocks模块
Add the StarRocks module to docker-compose.yml
使用docker-compose up -d拉起,并使用docker ps检查进程,确认正常启动
Start it using docker-compose up -d
, and check the process with docker ps
to confirm it has started successfully.
使用本地mysql进行连接StarRocks(注意mac安装要用mysql9以内的版本,比如brew install mysql@8.4)
Use a local MySQL instance to connect to StarRocks (Note: On Mac, install a MySQL version below 9, such as brew install mysql@8.4
).
mysql -h127.0.0.1 -uroot -P9030
参考paimon官方文档,建立一个paimon的catalog
Refer to the official Paimon documentation and create a Paimon catalog.
切换到该catalog,查看元数据
Switch to the catalog and check the metadata.
执行select进行查询,可以看到和flink sql-client实时查询的结果是一样的
Execute a SELECT
query, and you will see the same real-time query results as in Flink SQL-Client.
在该Flink -> HDFS -> StarRocks 场景中,查询的主要计算资源开销由 StarRocks 负责,执行分布式 SQL 查询的计算引擎,而 HDFS 仅提供存储和数据读取能力,不承担计算任务。因此,优化 StarRocks 的计算能力和 HDFS 读取性能,将能显著提升查询效率。
In the Flink -> HDFS -> StarRocks scenario, the main computational resource overhead for queries is handled by StarRocks, which serves as the distributed SQL query execution engine. HDFS only provides storage and data reading capabilities, without taking on any computational tasks. Therefore, optimizing the computational performance of StarRocks and the data reading performance of HDFS can significantly improve query efficiency.