[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - RKE2 환경 준비
페이지 정보
본문
안녕하세요.
꿈꾸는여행자입니다.
최근 GPU Driver 설치 관련 사항이 있어, 이에 대한 주제를 다루고자 합니다.
좋은 GPU 그래픽 카드가 없는 관계로,
테스트 환경은 ThinkPad P53 자체의 외장 그래픽을 활용하였습니다.
Rocky Linux 9.4 환경에서 Nvidia GPU Driver를 설치하고,
이를 기반한 RKE2 환경을 구성 하였으며, RKE2 환경에 GPU Operator를 사용하여,
실질적으로 GPU가 사용되는지 ollama를 사용하여 테스트하는 시나리오 입니다.
Rocky Linux 9.4는 이미 설치되어 있다는 가정하에 진행한 상황입니다.
이번 항목에서는
RKE2 환경에 대한 요건 확인 내용입니다.
상세 내역은 아래와 같습니다.
감사합니다.
> 아래
________________
목차
I. Overview
1. 구성 정보
II. RKE2
1. Installation
1.2. Requirements
1.2.1.Prerequisites
1.2.2. Operating Systems
1.2.2.1. Linux
1.2.2.1.1. RKE2 v1.30
1.2.3. Hardware
1.2.3.1. Linux/Windows
1.2.3.2. VM Sizing Guide
1.2.3.2.1. Disks
1.2.4. Networking
1.2.4.1. Inbound Network Rules
1.2.4.2. CNI Specific Inbound Network Rules
1.2.4.2.1. Cilium
1.2.4.3. Windows Specific Inbound Network Rules
________________
________________
I. Overview
1. 구성 정보
* 클러스터 버전: RKE2 v1.30.5
* 클러스터 구성
* 마스터 (Control Plane): 1대
* 클러스터 노드 (Worker Nodes): 3대
* 설치 소프트웨어
* GPU Operator: GPU 노드 설정용
________________
II. RKE2
1. Installation
1.2. Requirements
RKE2 is very lightweight, but has some minimum requirements as outlined below.
1.2.1.Prerequisites
Two rke2 nodes cannot have the same node name. By default, the node name is taken from the machine's hostname.
If two or more of your machines have the same hostname, you must do one of the following:
* Update the hostname to a unique value
* Set the node-name parameter in the config file to a unique value
* Set the with-node-id parameter in the config file to true to append a randomly generated ID number to the hostname.
1.2.2. Operating Systems
1.2.2.1. Linux
See the RKE2 Support Matrix for all the OS versions that have been validated with RKE2. In general, RKE2 should work on any Linux distribution that uses systemd and iptables.
1.2.2.1.1. RKE2 v1.30
This matrix is revised as of v1.30.5+rke2r1
https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-30/
| OS | OS Version |
| :---- | :---- |
| SLES | 15 SP6 |
| | 15 SP5 |
| | 15 SP4 |
| | 15 SP3 |
| SLE Micro 10 | 6.0 |
| | 5.5 |
| | 5.4 |
| | 5.3 |
| OpenSUSE Leap | 15.6 |
| | 15.5 |
| | 15.4 |
| | 15.3 |
| SUSE Liberty | 8.9 |
| Oracle Linux | **9.4** |
| | 9.3 |
| | 9.2 |
| | 8.10 |
| | 8.9 |
| | 8.8 |
| RHEL | 9.4 |
| | 9.3 |
| | 9.2 |
| | 9.1 |
| | 8.10 |
| | 8.9 |
| | 8.8 |
| | 8.7 |
| Rocky Linux | 9.4 |
| | 9.3 |
| | 9.2 |
| | 9.1 |
| | 8.10 |
| | 8.9 |
| | 8.8 |
| | 8.7 |
| Ubuntu | 24.04 |
| | 22.04 |
| | 20.04 |
K8s Components 2,3,4,5
Architecture
Kubernetes: v1.30.5
RKE2 Version: v1.30.5+rke2r1
Etcd: v3.5.13-k3s1
Containerd: v1.7.21-k3s1
Runc: v1.1.14
Metrics-server: v0.7.1
CoreDNS: v1.11.1
Ingress-Nginx: v1.10.4-hardened3
Helm-controller: v0.16.4
CNI: Canal (Flannel: v0.25.6, Calico: v3.28.1)
CNI: Calico v3.28.1
CNI: Cilium v1.16.1
CNI: Multus v4.1.0
x86_64
arm64 (experimental)
1.2.3. Hardware
Hardware requirements scale based on the size of your deployments. Minimum recommendations are outlined here.
1.2.3.1. Linux/Windows
* RAM: 4GB Minimum (we recommend at least 8GB)
* CPU: 2 Minimum (we recommend at least 4CPU)
1.2.3.2. VM Sizing Guide
When limited on CPU and RAM on the control-plane + etcd nodes, there could be limitations for the amount of agent nodes that can be joined under standard workload conditions.
| Server CPU | Server RAM | Number of Agents |
| :---- | :---- | :---- |
| 2 | 4 GB | 0-225 |
| 4 | 8 GB | 226-450 |
| 8 | 16 GB | 451-1300 |
| 16+ | 32 GB | 1300+ |
It is recommended to join agent nodes in batches of 50 or less to allow the CPU to free up space, as there is a spike on node join. Remember to modify the default cluster-cidr if desiring more than 255 nodes!
This data was retrieved under specific test conditions. It will vary depending upon environment and workloads. The steps below give an overview of the test that was run to retrieve this. It was last performed on v1.27.4+rke2r1. All of the machines were provisioned in AWS with standard 20 GiB gp3 volumes.
1. Monitor resources on grafana using prometheus data source.
2. Deploy workloads in such a way to simulate continuous cluster activity:
* A basic workload that scales up and down continuously
* A workload that is deleted and recreated in a loop
* A constant workload that contains multiple other resources including CRDs.
3. Join agent nodes in batches of 30-50 at a time.
1.2.3.2.1. Disks
RKE2 performance depends on the performance of the database, and since RKE2 runs etcd embeddedly and it stores the data dir on disk, we recommend using an SSD when possible to ensure optimal performance.
1.2.4. Networking
* Important
* If your node has NetworkManager installed and enabled, ensure that it is configured to ignore CNI-managed interfaces.. If your node has Wicked installed and enabled, ensure that the forwarding sysctl config is enabled
The RKE2 server needs port 6443 and 9345 to be accessible by other nodes in the cluster.
All nodes need to be able to reach other nodes over UDP port 8472 when Flannel VXLAN is used.
If you wish to utilize the metrics server, you will need to open port 10250 on each node.
Important: The VXLAN port on nodes should not be exposed to the world as it opens up your cluster network to be accessed by anyone. Run your nodes behind a firewall/security group that disables access to port 8472.
1.2.4.1. Inbound Network Rules
| Port | Protocol | Source | Destination | Description |
| :---- | :---- | :---- | :---- | :---- |
| 6443 | TCP | RKE2 agent nodes | RKE2 server nodes | Kubernetes API |
| 9345 | TCP | RKE2 agent nodes | RKE2 server nodes | RKE2 supervisor API |
| 10250 | TCP | All RKE2 nodes | All RKE2 nodes | kubelet metrics |
| 2379 | TCP | RKE2 server nodes | RKE2 server nodes | etcd client port |
| 2380 | TCP | RKE2 server nodes | RKE2 server nodes | etcd peer port |
| 2381 | TCP | RKE2 server nodes | RKE2 server nodes | etcd metrics port |
| 30000-32767 | TCP | All RKE2 nodes | All RKE2 nodes | NodePort port range |
sudo firewall-cmd --add-port=6443/tcp --permanent
sudo firewall-cmd --add-port=9345/tcp --permanent
sudo firewall-cmd --add-port=10250/tcp --permanent
sudo firewall-cmd --add-port=2379-2381/tcp --permanent
sudo firewall-cmd --add-port=30000-32767/tcp --permanent
sudo firewall-cmd --reload
sudo firewall-cmd --list-all
[root@host 20241017_RKE2]# sudo firewall-cmd --add-port=6443/tcp --permanent
sudo firewall-cmd --add-port=9345/tcp --permanent
sudo firewall-cmd --add-port=10250/tcp --permanent
sudo firewall-cmd --add-port=2379-2381/tcp --permanent
sudo firewall-cmd --add-port=30000-32767/tcp --permanent
success
success
success
success
success
[root@host 20241017_RKE2]# sudo firewall-cmd --reload
success
[root@host 20241017_RKE2]# sudo firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: wlp82s0
sources:
services: cockpit dhcpv6-client ssh
ports: 5931/tcp 5932/tcp 5933/tcp 6443/tcp 9345/tcp 10250/tcp 2379-2381/tcp 30000-32767/tcp
protocols:
forward: yes
masquerade: yes
forward-ports:
source-ports:
icmp-blocks:
rich rules:
[root@host 20241017_RKE2]#
1.2.4.2. CNI Specific Inbound Network Rules
1.2.4.2.1. Cilium
| Port | Protocol | Source | Destination | Description |
| :---- | :---- | :---- | :---- | :---- |
| 8/0 | ICMP | All RKE2 nodes | All RKE2 nodes | Cilium CNI health checks |
| 4240 | TCP | All RKE2 nodes | All RKE2 nodes | Cilium CNI health checks |
| 8472 | UDP | All RKE2 nodes | All RKE2 nodes | Cilium CNI with VXLAN |
1.2.4.3. Windows Specific Inbound Network Rules
| Protocol | Port | Source | Destination | Description |
| :---- | :---- | :---- | :---- | :---- |
| UDP | 4789 | All RKE2 nodes | All RKE2 nodes | Required for Calico and Flannel VXLAN |
| TCP | 179 | All RKE2 nodes | All RKE2 nodes | Calico CNI with BGP |
Typically, all outbound traffic will be allowed.
- 다음글[사례] 동아대학교 수세 랜처 도입 사례 24.08.20
댓글목록
등록된 댓글이 없습니다.