[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - RKE2 환경 준비 > SUSE Rancher자료실

[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - RKE2 환경 준비

페이지 정보

작성자 꿈꾸는여행자
댓글 0건 조회 117회 작성일 24-10-24 09:47

본문

안녕하세요.

꿈꾸는여행자입니다.

최근 GPU Driver 설치 관련 사항이 있어, 이에 대한 주제를 다루고자 합니다.

좋은 GPU 그래픽 카드가 없는 관계로,

테스트 환경은 ThinkPad P53 자체의 외장 그래픽을 활용하였습니다.

Rocky Linux 9.4 환경에서 Nvidia GPU Driver를 설치하고,

이를 기반한 RKE2 환경을 구성 하였으며, RKE2 환경에 GPU Operator를 사용하여,

실질적으로 GPU가 사용되는지 ollama를 사용하여 테스트하는 시나리오 입니다.

Rocky Linux 9.4는 이미 설치되어 있다는 가정하에 진행한 상황입니다.

이번 항목에서는

RKE2 환경에 대한 요건 확인 내용입니다.

상세 내역은 아래와 같습니다.

감사합니다.

> 아래

________________

I. Overview

1. 구성 정보

II. RKE2

1. Installation

1.2. Requirements

1.2.1.Prerequisites

1.2.2. Operating Systems

1.2.2.1. Linux

1.2.2.1.1. RKE2 v1.30

1.2.3. Hardware

1.2.3.1. Linux/Windows

1.2.3.2. VM Sizing Guide

1.2.3.2.1. Disks

1.2.4. Networking

1.2.4.1. Inbound Network Rules

1.2.4.2. CNI Specific Inbound Network Rules

1.2.4.2.1. Cilium

1.2.4.3. Windows Specific Inbound Network Rules

________________

I. Overview

1. 구성 정보

* 클러스터 버전: RKE2 v1.30.5

* 클러스터 구성

* 마스터 (Control Plane): 1대

* 클러스터 노드 (Worker Nodes): 3대

* 설치 소프트웨어

* GPU Operator: GPU 노드 설정용

________________

II. RKE2

1. Installation

1.2. Requirements

RKE2 is very lightweight, but has some minimum requirements as outlined below.

1.2.1.Prerequisites

Two rke2 nodes cannot have the same node name. By default, the node name is taken from the machine's hostname.

If two or more of your machines have the same hostname, you must do one of the following:

* Update the hostname to a unique value

* Set the node-name parameter in the config file to a unique value

* Set the with-node-id parameter in the config file to true to append a randomly generated ID number to the hostname.

1.2.2. Operating Systems

1.2.2.1. Linux

See the RKE2 Support Matrix for all the OS versions that have been validated with RKE2. In general, RKE2 should work on any Linux distribution that uses systemd and iptables.

1.2.2.1.1. RKE2 v1.30

This matrix is revised as of v1.30.5+rke2r1

https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-30/

| OS | OS Version |

| :---- | :---- |

| SLES | 15 SP6 |

| | 15 SP5 |

| | 15 SP4 |

| | 15 SP3 |

| SLE Micro 10 | 6.0 |

| | 5.5 |

| | 5.4 |

| | 5.3 |

| OpenSUSE Leap | 15.6 |

| | 15.5 |

| | 15.4 |

| | 15.3 |

| SUSE Liberty | 8.9 |

| Oracle Linux | **9.4** |

| | 9.3 |

| | 9.2 |

| | 8.10 |

| | 8.9 |

| | 8.8 |

| RHEL | 9.4 |

| | 9.3 |

| | 9.2 |

| | 9.1 |

| | 8.10 |

| | 8.9 |

| | 8.8 |

| | 8.7 |

| Rocky Linux | 9.4 |

| | 9.3 |

| | 9.2 |

| | 9.1 |

| | 8.10 |

| | 8.9 |

| | 8.8 |

| | 8.7 |

| Ubuntu | 24.04 |

| | 22.04 |

| | 20.04 |

K8s Components 2,3,4,5

Architecture

Kubernetes: v1.30.5

RKE2 Version: v1.30.5+rke2r1

Etcd: v3.5.13-k3s1

Containerd: v1.7.21-k3s1

Runc: v1.1.14

Metrics-server: v0.7.1

CoreDNS: v1.11.1

Ingress-Nginx: v1.10.4-hardened3

Helm-controller: v0.16.4

CNI: Canal (Flannel: v0.25.6, Calico: v3.28.1)

CNI: Calico v3.28.1

CNI: Cilium v1.16.1

CNI: Multus v4.1.0

x86_64

arm64 (experimental)

1.2.3. Hardware

Hardware requirements scale based on the size of your deployments. Minimum recommendations are outlined here.

1.2.3.1. Linux/Windows

* RAM: 4GB Minimum (we recommend at least 8GB)

* CPU: 2 Minimum (we recommend at least 4CPU)

1.2.3.2. VM Sizing Guide

When limited on CPU and RAM on the control-plane + etcd nodes, there could be limitations for the amount of agent nodes that can be joined under standard workload conditions.

| Server CPU | Server RAM | Number of Agents |

| :---- | :---- | :---- |

| 2 | 4 GB | 0-225 |

| 4 | 8 GB | 226-450 |

| 8 | 16 GB | 451-1300 |

| 16+ | 32 GB | 1300+ |

It is recommended to join agent nodes in batches of 50 or less to allow the CPU to free up space, as there is a spike on node join. Remember to modify the default cluster-cidr if desiring more than 255 nodes!

This data was retrieved under specific test conditions. It will vary depending upon environment and workloads. The steps below give an overview of the test that was run to retrieve this. It was last performed on v1.27.4+rke2r1. All of the machines were provisioned in AWS with standard 20 GiB gp3 volumes.

1. Monitor resources on grafana using prometheus data source.

2. Deploy workloads in such a way to simulate continuous cluster activity:

* A basic workload that scales up and down continuously

* A workload that is deleted and recreated in a loop

* A constant workload that contains multiple other resources including CRDs.

3. Join agent nodes in batches of 30-50 at a time.

1.2.3.2.1. Disks

RKE2 performance depends on the performance of the database, and since RKE2 runs etcd embeddedly and it stores the data dir on disk, we recommend using an SSD when possible to ensure optimal performance.

1.2.4. Networking

* Important

* If your node has NetworkManager installed and enabled, ensure that it is configured to ignore CNI-managed interfaces.. If your node has Wicked installed and enabled, ensure that the forwarding sysctl config is enabled

The RKE2 server needs port 6443 and 9345 to be accessible by other nodes in the cluster.

All nodes need to be able to reach other nodes over UDP port 8472 when Flannel VXLAN is used.

If you wish to utilize the metrics server, you will need to open port 10250 on each node.

Important: The VXLAN port on nodes should not be exposed to the world as it opens up your cluster network to be accessed by anyone. Run your nodes behind a firewall/security group that disables access to port 8472.

1.2.4.1. Inbound Network Rules

| :---- | :---- | :---- | :---- | :---- |

sudo firewall-cmd --add-port=6443/tcp --permanent

sudo firewall-cmd --add-port=9345/tcp --permanent

sudo firewall-cmd --add-port=10250/tcp --permanent

sudo firewall-cmd --add-port=2379-2381/tcp --permanent

sudo firewall-cmd --add-port=30000-32767/tcp --permanent

sudo firewall-cmd --reload

sudo firewall-cmd --list-all

[root@host 20241017_RKE2]# sudo firewall-cmd --add-port=6443/tcp --permanent

sudo firewall-cmd --add-port=9345/tcp --permanent

sudo firewall-cmd --add-port=10250/tcp --permanent

sudo firewall-cmd --add-port=2379-2381/tcp --permanent

sudo firewall-cmd --add-port=30000-32767/tcp --permanent

success

[root@host 20241017_RKE2]# sudo firewall-cmd --reload

success

[root@host 20241017_RKE2]# sudo firewall-cmd --list-all

public (active)

target: default

icmp-block-inversion: no

interfaces: wlp82s0

sources:

services: cockpit dhcpv6-client ssh

ports: 5931/tcp 5932/tcp 5933/tcp 6443/tcp 9345/tcp 10250/tcp 2379-2381/tcp 30000-32767/tcp

protocols:

forward: yes

masquerade: yes

forward-ports:

source-ports:

icmp-blocks:

rich rules:

[root@host 20241017_RKE2]#

1.2.4.2. CNI Specific Inbound Network Rules

1.2.4.2.1. Cilium

| :---- | :---- | :---- | :---- | :---- |

1.2.4.3. Windows Specific Inbound Network Rules

| :---- | :---- | :---- | :---- | :---- |

Typically, all outbound traffic will be allowed.

다음글[사례] 동아대학교 수세 랜처 도입 사례 24.08.20

댓글목록

등록된 댓글이 없습니다.

[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - RKE2 환경 준비 > SUSE Rancher자료실

인기검색어

SUSE Rancher자료실