[리눅스 커널] Cpuidle - idle process

Linux/kernel 2023. 9. 12. 23:53

글의 참고

- https://www.weigao.cc/linux/kernel/idle.html#cpu-startup-entry

- https://www.lmlphp.com/user/58023/article/item/787257/

- https://zhuanlan.zhihu.com/p/542475575

- https://www.cnblogs.com/0patrick/p/14198742.html

- https://zhuanlan.zhihu.com/p/539722367

글의 전제

- 밑줄로 작성된 글은 강조 표시를 의미한다.

- 그림 출처는 항시 그림 아래에 표시했다.

글의 내용

- idle process

: `idle process`는 PID가 0번인 프로세스로, 시스템이 초기화가 끝나는 시점에 자발적으로 `idle process`로 전환이 된다. 그래서 2개의 프로세스는 PID가 동일하다. 참고로, `idle process`는 `fork`가 불가능한 프로세스다. SMP 환경에서 각 코어들은 자신만의 `런큐`를 가지고 있고, 또 각 런큐에는 하나의 `idle process`를 가지고 있다. CPU 코어는 런큐에서 실행할 프로세스 및 스레드가 없으면, `idle process`를 실행한다. 이 말은, 시스템이 얼마동안 IDLE 상태에 있었는지 정량적인 시간으로 알기 위해서는 `idle process`가 얼마동안 실행되었는지를 파악하면 된다는 소리다. 그런데, `idle process`는 어떻게 생성될까?

: 부트 로더(GRUB/U-boot/LK 등)에 의해서 리눅스 커널이(vmlinux)가 메모리에 로드되어 `startup_32(head.S)`를 실행중이라고 가정하자. 그런데, 이 시점에는 어떤 프로세스가 코드를 실행하고 있는걸까? 정답은 `없다`이다. 이 시점에는 `프로세스`라는 개념이없다. 프로세스를 사용하려면, 프로세스를 생성/변경/제거 하는 인터페이스 및 자료 구조들이 모두 초기화가 완료되어야 만들 수 있다. 그런데, 위에서 말한 코드들이 초기화되는 시점에는 프로세스 초기화도 안되어있기 때문에, 저 시점에는 프로세스라는 개념이없다. 결국, 리눅스에서 `0`번 프로세스라는 것은 부트업 과정(`start_kernel`)을 진행하면서 별도의 메모리를 할당받지 않으면서도, 커널 전체 영역을 사용할 수 있는 강력한 프로세스다.

: `0`번 프로세스가 `start_kernel` 함수를 호출해서 커널 초기화 작업을 마무리하면, 자신을 복제하는 `fork` 함수를 호출해서 첫 번째 유저 프로세스를 생성한다. 이 때, 생성되는 프로세스가 그 유명한 PID `1`번인 `init process`다. `init process`는 남아있는 유저 영역에 대한 초기화 작업들을 진행한다. 이 시점에 `0`번 프로세스는 `do_idle` 함수를 호출해서 `idle_process`로 전환한다. 이 글에서는 `do_idle` 함수부터 해서 CPU가 어떻게 IDLE 상태로 진입하는지를 코드 레벨에서 알아보도록 한다.

: 리눅스 커널에서는 CPU 코어가 idle 상태로 진입하려면, 그 진입점은 `idle process`가 실행되는 지점이라고 볼 수 있다. `idle process`가 실행되는 패스는 2가지가 있다.

1. 부팅 프로세스에서, CPU 코어들은 각자 `startup process`를 완료하면, 자체적으로 `idle process`로 전환한다.
2. 특정 CPU 코어의 런큐에서 더 이상 실행할 스레드가 없을 경우, 스케줄러는 `idle process`를 실행한다.

1. 부트 프로세스에서 `primary processor`가 `idle process`로 전환되는 과정은 다음과 같다.

https://zhuanlan.zhihu.com/p/539722367

2. 부트 프로세스에서 `secondary processor`가 `idle process`로 전환되는 과정은 다음과 같다.

https://zhuanlan.zhihu.com/p/539722367

3. 해당 CPU가 실행 가능한 프로세스가 없다면, `idle process`로 전환한다.

https://zhuanlan.zhihu.com/p/539722367

: 위에 3가지 과정은 결국 모두 `do_idle` 함수를 호출한다. 다음은 `arm64` CPU가 IDLE 상태로 가는 과정을 보여준다(`cpuidle_enter_state` 함수 다음은 `target_state->enter` 함수를 의미한다).

https://zhuanlan.zhihu.com/p/539722367

: CPU idle로 진입하는 과정은 크게 2가지로 나뉜다.

1. IDLE 상태 모니터링을 Polling 모드로 진행 : cpu_idle_poll
2. IDLE 상태 모니터링을 Polling 모드가 아닌 다른 방식으로 진행 : cpuidle_idle_call
: `2`번에서도 2가지 방식으로 나뉠 수 가 있다.
2.1 Cpuidle framework를 지원하지 않는 경우 : default_idle_call
2.2 Cpuidle frameowkr를 지원하는 경우 : call_cpuidle

: 위의 흐름을 머릿속에 각인시키 상태로 `do_idle` 함수부터 분석해보자.

- do_idle

: 부트 프로세스에서 `idle process`로 전환할 때, 호출되는 함수가 `cpu_startup_entry` 함수다. 부트업 프로세스에서 이 함수를 통해서 `0`번 프로세스가 `idle process`로 전환이 된다.

// kernel/sched/idle.c - v6.5
void cpu_startup_entry(enum cpuhp_state state)
{
	arch_cpu_idle_prepare();
	cpuhp_online_idle(state);
	while (1)
		do_idle();
}

: 본격적으로 `do_idle` 함수를 분석하기전에 함수 네이밍 관련해서 패치 내용을 알 필요가 있다. 아래 패치 내용을 보면, `do_idle` 이라는 이름을 사용한지가 꾀 되었지만, 정보 자체가 많지않다. 그래서 `do_idle` 의 내용을 찾아보고 싶다면, 구글에 `cpu_idle_loop` 로 찾아봐도 무방하다.

1. https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1278840.html
- `cpu_idle` -> `cpu_idle_loop` -> `do_idle` 순으로 이름이 변경됨. 현재는 `do_idle` 로 사용 중.

: `do_idle` 함수는 `idle process`가 CPU를 IDLE 상태에 진입시키기 위한 본격적인 엔트리 포인트라고 볼 수 있다. 그리고, 이 함수는 무한 루프를 통해서 매번 실행 가능한 스레드가 있는지를 체크한다. 그리고, 실행 가능한 스레드가 존재하면 IDLE 상태를 종료한다.

// kernel/sched/idle.c - v6.5
/*
 * Generic idle loop implementation
 *
 * Called with polling cleared.
 */
static void do_idle(void)
{
	int cpu = smp_processor_id(); // --- 1

	/*
	 * Check if we need to update blocked load
	 */
	nohz_run_idle_balance(cpu);

	/*
	 * If the arch has a polling bit, we maintain an invariant:
	 *
	 * Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=
	 * rq->idle). This means that, if rq->idle has the polling bit set,
	 * then setting need_resched is guaranteed to cause the CPU to
	 * reschedule.
	 */

	__current_set_polling(); // --- 2
	tick_nohz_idle_enter();

	while (!need_resched()) { // --- 3
		rmb();

		local_irq_disable(); // --- 4

		if (cpu_is_offline(cpu)) { // --- 5
			tick_nohz_idle_stop_tick();
			cpuhp_report_idle_dead();
			arch_cpu_idle_dead();
		}

		arch_cpu_idle_enter(); // --- 6
		rcu_nocb_flush_deferred_wakeup();

		/*
		 * In poll mode we reenable interrupts and spin. Also if we
		 * detected in the wakeup from idle path that the tick
		 * broadcast device expired for us, we don't want to go deep
		 * idle as we know that the IPI is going to arrive right away.
		 */
		if (cpu_idle_force_poll || tick_check_broadcast_expired()) { // --- 7
			tick_nohz_idle_restart_tick();
			cpu_idle_poll();
		} else {
			cpuidle_idle_call();
		}
		arch_cpu_idle_exit();
	}

	/*
	 * Since we fell out of the loop above, we know TIF_NEED_RESCHED must
	 * be set, propagate it into PREEMPT_NEED_RESCHED.
	 *
	 * This is required because for polling idle loops we will not have had
	 * an IPI to fold the state for us.
	 */
	preempt_set_need_resched();
	tick_nohz_idle_exit();
	__current_clr_polling();

	/*
	 * We promise to call sched_ttwu_pending() and reschedule if
	 * need_resched() is set while polling is set. That means that clearing
	 * polling needs to be visible before doing these things.
	 */
	smp_mb__after_atomic();

	/*
	 * RCU relies on this call to be done outside of an RCU read-side
	 * critical section.
	 */
	flush_smp_call_function_queue();
	schedule_idle();

	if (unlikely(klp_patch_pending(current)))
		klp_update_patch_state(current);
}

: `do_idle` 함수의 프로세스는 다음과 같다.

1. `idle process`는 각 CPU 코어마다 존재한다. 어떤 CPU 코어가 IDLE 상태로 들어갈 지를 판단하기 위해서, 현재 실행 중 인 CPU 번호가 필요하다.

2. CPU가 IDLE 상태로 진입할 때는, 기본적으로 아키텍처 레벨에서 2가지 방식을 지원한다(아래 참고). 그런데, 리눅스는 최초에 x86 기반으로 만들어졌기 때문에, CPU IDLE 상태의 기본 모드를 `폴링` 방식으로 채택한다. 그래서, `_current_set_polling` 함수가 해당 CPU의 IDLE 모드를 `폴링`으로 설정하게 만든다.
- Polling : `ARM` 같은 경우, `yield` 명령어를 수행한다고 보면 된다.
- interrupt & event wait : `ARM`같은 경우에는, `WFI` 및 `WFE` 명령어를 수행한다고 보면 된다.

: `tick_nohz_idle_enter` 함수는 `dynamic tick`으로 전환시키는 프로세스다. 이 글을 참고하자.

3. 특정 CPU 코어에서 `idle process`가 실행되었다는 것은 해당 CPU의 런큐에서 실행시킬 스레드가 없다는 것이다. 만약, 실행시킬 스레드가 생기면 어떻게 해야 할까? 당연히 `idle process`가 곧 바로 종료시키고 해당 스레드를 실행시켜야 한다. 즉, 스케줄링이 되어야 한다. 실행할 스레드가 있다는 판단은 `need_resched` 함수를 통해 알 수 있다.

4. CPU가 IDLE 상태로 진입하는 과정은 방해가 없어야 한다. 그러므로, 인터럽트를 비활성화하고 CPU IDLE 상태가 정상적으로 들어가는 것을 보장해줘야 한다. 오해하면 안된다. IDLE 상태에 정상적으로 들어가면, 웨이크-업 인터럽트는 활성화 해놓는다. IDLE 상태에 진입할 때만 인터럽트를 비활성화한다. 그러나, 인터럽트는 마지막에나 되어서야 활성화된다. 그렇다면, 어떻게 idle 상태에서 wake-up 할까? 이 글 마지막에서 다시 다룬다.

5. 만약에, 해당 CPU가 이미 `offline` 상태면 이건 IDLE 보다 훨씬 강력한 파워 세이빙 상태이므로, IDLE 진입 과정을 종료한다.

6. 아키텍처 종속적인 IDLE 함수를 호출한다. 내부적으로 `ledtrig_cpu` 함수를 호출하는데, ARM에서는 `CPU IDLE EVENT`에 대한 상황을 외부적으로 표시하기 위해 LED On/Off를 진행한다. 사실, 이 함수의 내부를 분석해보면 알겠지만 실제적인 IDLE 기능과 관련된 부분은 없다[참고1].

7. 해당 CPU가 어떤 방식으로 IDLE 상태를 모니터링 할 것인지에 따라 다른 함수가 호출된다. `ARM`을 예를 들면, 다음과 같다.
- polling 방식 : cpu_idle_poll 함수(shallow-sleep)
- WFI & WFI : cpuidle_idle_call 함수(deeper-sleep)

: 위에 조건문은 굉장히 중요한 조건문이다.`얕은 잠`을 잘 것이지 `깊은 잠`을 잘 것인지를 판별하는 조건문이다.

: `7`번 조건문을 좀 더 자세히 알아보자. 먼저, 뒤쪽에 있는 `tick_check_broadcast_expired` 함수부터 보자.

// kernel/time/tick-broadcast.c - v6.5
/*
 * Called before going idle with interrupts disabled. Checks whether a
 * broadcast event from the other core is about to happen. We detected
 * that in tick_broadcast_oneshot_control(). The callsite can use this
 * to avoid a deep idle transition as we are about to get the
 * broadcast IPI right away.
 */
noinstr int tick_check_broadcast_expired(void)
{
#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
	return arch_test_bit(smp_processor_id(), cpumask_bits(tick_broadcast_force_mask));
#else
	return cpumask_test_cpu(smp_processor_id(), tick_broadcast_force_mask);
#endif
}

: `브로드캐스트 타이머`는 CPU의 로컬 타이머가 `shutdown` 되었을 경우, 해당 CPU를 wake-up 시켜줄 타이머를 의미한다. 즉, 브로드캐스트 타이머 인터럽트가 발생했다는 것은 이제 깨어난다는 것을 의미하기 때문에, `deeper-sleep`으로 들어가면 안된다(`cpuidle_idle_call` 함수를 호출하면 안된다). 그래서 `shallow-sleep`인 polling을 하도록 하는 것이다(`cpu_idle_poll` 함수를 호출하도록 한다). 그렇다면, `cpu_idle_force_poll` 함수는 어떤 이유에서 `ture`가 될까? CPU가 `deeper-sleep`에 들어가지 못하게 하도록 하는 경우가 브로드 캐스트 타이머만 있는 것은 아니다. ACPI 에서는 아래와 `C-STATE`를 아래와 같이 나눈다[참고1].

1. C1 : `hlt` 명령어 사용 시
2. C3 : `mwait` 명령어 사용 시

: 위에서 `C3`는 `deeper-sleep`이다. 이 상태에서는 L1/L2 모두 flushed 되고, 모든 코어에 제공되는 클락들은 게이트된다. 이 경우 DMA 관련 문제가 발생할 수 있기 때문에, `polling`으로 처리하는 경우도 있다[참고1 참고2].

: IDLE 상태를 모니터링 하는 방식이 `polling`이 아니라면, `cpuidle_idle_call` 함수를 호출한다. cpuidle_idle_call 함수는 tick 을 멈출지 말지를 결정하는 함수다. 이 함수는 폴링에 비해 상당히 복잡한 프로세스를 가지고 있다.

// kernel/sched/idle.c - v6.5
/**
 * cpuidle_idle_call - the main idle function
 *
 * NOTE: no locks or semaphores should be used here
 *
 * On architectures that support TIF_POLLING_NRFLAG, is called with polling
 * set, and it returns with polling set.  If it ever stops polling, it
 * must clear the polling bit.
 */
static void cpuidle_idle_call(void)
{
	struct cpuidle_device *dev = cpuidle_get_device();
	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
	int next_state, entered_state;

	/*
	 * Check if the idle task must be rescheduled. If it is the
	 * case, exit the function after re-enabling the local irq.
	 */
	if (need_resched()) { // --- 1
		local_irq_enable();
		return;
	}

	/*
	 * The RCU framework needs to be told that we are entering an idle
	 * section, so no more rcu read side critical sections and one more
	 * step to the grace period
	 */

	if (cpuidle_not_available(drv, dev)) { // --- 2
		tick_nohz_idle_stop_tick();

		default_idle_call();
		goto exit_idle;
	}

	/*
	 * Suspend-to-idle ("s2idle") is a system state in which all user space
	 * has been frozen, all I/O devices have been suspended and the only
	 * activity happens here and in interrupts (if any). In that case bypass
	 * the cpuidle governor and go straight for the deepest idle state
	 * available.  Possibly also suspend the local tick and the entire
	 * timekeeping to prevent timer interrupts from kicking us out of idle
	 * until a proper wakeup interrupt happens.
	 */

	if (idle_should_enter_s2idle() || dev->forced_idle_latency_limit_ns) { // --- 3
		u64 max_latency_ns;

		if (idle_should_enter_s2idle()) {

			entered_state = call_cpuidle_s2idle(drv, dev);
			if (entered_state > 0)
				goto exit_idle;

			max_latency_ns = U64_MAX;
		} else {
			max_latency_ns = dev->forced_idle_latency_limit_ns;
		}

		tick_nohz_idle_stop_tick();

		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
		call_cpuidle(drv, dev, next_state);
	} else { // --- 4
		bool stop_tick = true;

		/*
		 * Ask the cpuidle framework to choose a convenient idle state.
		 */
		next_state = cpuidle_select(drv, dev, &stop_tick);

		if (stop_tick || tick_nohz_tick_stopped())
			tick_nohz_idle_stop_tick();
		else
			tick_nohz_idle_retain_tick();

		entered_state = call_cpuidle(drv, dev, next_state);
		/*
		 * Give the governor an opportunity to reflect on the outcome
		 */
		cpuidle_reflect(dev, entered_state);
	}

exit_idle:
	__current_set_polling();

	/*
	 * It is up to the idle functions to reenable local interrupts
	 */
	if (WARN_ON_ONCE(irqs_disabled()))
		local_irq_enable();
}

1. 당연하지만, IDLE 상태를 진입하는 도중에 스케줄링 가능한 스레드가 존재하면 반드시 깨어나야 한다.

2. `Cpuidle framework`가 사용 가능한지를 체크한다. 만약, `Cpuidle framework`를 사용할 수 없다면, `default_idle_call` 함수를 호출해서 아주 기본적인 IDLE 상태로 진입시킨다. 뒤에서 다시 다룬다.

3. 이 코드에 도달하는 경우는 2가지가 있다.

1. idle process 에 의해서 이 코드에 도달하는 경우.
2. 시스템 suspend-to-idle 과정에서도 이 코드에 도달하는 경우.

만약, 후자라면, CPU가 동작할 수 있는 영역은 여기와 인터럽트가 발생했을 때 뿐이다. 왜냐면, `suspend-to-idle` 상태는 모든 유저 프로세스들과 I/O 디바이스들이 suspended 상태이기 때문에, CPU 가 처리할 일 은 IDLE 로 지입하는 코드와 인터럽트가 발생했을 때, 인터럽트 핸들러를 처리하는 코드뿐이다. 그래서, 리눅스 커널은 이 시점을 `deeper-sleep`이 가능한 상태라고 판단한다. 그래서, `cpuidle governor` 를 통해서 idle state 를 얻지 않고, 곧 바로 가장 강력한 deepest-idle 상태로 진입한다. 이게 이해가 가지 않는다면, 전자의 경우를 생각하면 된다.

idle process 를 통해서 idle state 로 진입할 경우, 고려해야 할 상황들이 굉장히 많다. 후자처럼, 프로세스도 모두 freeze 되어있고, I/O 디바이스들이 모두 suspended 상태라면, 어지간하면 wake-up 할 일이 없을 것이다. 그런데, 전자는 프로세스도 모두 살아있고, suspended 된 I/O 디바이스들도 없다보니 언제 wake-up 할지를 알 수 가 없다. 그래서, cpuidle governor 를 통해서 현재 상태를 판단해서 얼마나 idle duration 을 가져갈지를 판단해야 한다.

그런데, 왜 `suspend-to-idle` 만 언급할까? `suspend-to-ram` 은 이 코드를 거치지 않나? suspend-to-ram 은 이 코드를 거치지 않는다. 왜냐면, idle state 는 CPU를 `멈추는 것` 이다. 그런데, `suspend-to-ram`은 CPU를 `power-off` 시킨다. 개념도 다를 뿐더러, 아예 이 루틴을 타지 않는다.

4. Cpuidle governor에 의해서 IDLE 상태를 진입하는 코드 라인이다.
- cpuidle_select : `Ask the cpuidle framework to choose a convenient idle state.` 말처럼, 최적화된 IDLE 상태를 얻어내는게 아닌, 대략적으로 파악한 IDLE 상태를 얻어내는 것이다. 뒤에서 보겠지만, 여기서 선택된 IDLE 상태로 변경할 수 도 있고, 다른 상태로 바꿀 수 도 있다.
- tick_nohz_idle_stop_tick : IDLE 상태에서 주기적으로 호출되는 타이머 인터럽트(`tick`)을 멈춘다. 이 함수는 `bool stop_tick = true` 때문에 무조건 호출되는 함수다.
- call_cpuidle : 실제 CPU를 IDLE 선택된 IDLE 상태로 전환시키는 함수. 이 함수를 호출되는 시점으로 IDLE 상태에 정상적으로 진입하면, 뒤에 코드는 실행되지 않는다.
- cpuidle_reflect : `reflect` 함수는 이전 IDLE 상태에 대한 정보를 저장하는 함수다. 이 함수가 호출됬다는 것은 IDLE 상태에서 wakeup 됬다는 것을 의미한다. 예를 들어, 이전 IDLE 상태는 뭐였는지, 현재 시간을 저장해서 이전 IDLE 상태가 얼마나 지속됬는지 등을 저장한다.

: `default_idle_call` 함수 내용은 다음과 같다. `arm64` 같은 경우는, 최총적으로 `wfi` 함수가 호출된다는 것만 알면 된다.

// arch/arm64/kernel/idle.c - v6.5
/*
 *	cpu_do_idle()
 *
 *	Idle the processor (wait for interrupt).
 *
 *	If the CPU supports priority masking we must do additional work to
 *	ensure that interrupts are not masked at the PMR (because the core will
 *	not wake up if we block the wake up signal in the interrupt controller).
 */
void noinstr cpu_do_idle(void)
{
	struct arm_cpuidle_irq_context context;

	arm_cpuidle_save_irq_context(&context);

	dsb(sy);
	wfi();

	arm_cpuidle_restore_irq_context(&context);
}

/*
 * This is our default idle handler.
 */
void noinstr arch_cpu_idle(void)
{
	/*
	 * This should do all the clock switching and wait for interrupt
	 * tricks
	 */
	cpu_do_idle();
}
....

// kernel/sched/idle.c - v6.5
/**
 * default_idle_call - Default CPU idle routine.
 *
 * To use when the cpuidle framework cannot be used.
 */
void __cpuidle default_idle_call(void)
{
	.....
	if (!current_clr_polling_and_test()) {
		....
		arch_cpu_idle();
		....
	}
	....
}

: `3`번 조건문에서 `forced_idle_latency_limit_ns` 변수는 강제로 가장 강력한 IDLE 상태로 진입시키는 플래그 변수다. 이 변수가 SET 되면, Cpuidle governor를 거치지 않는다. 대신, 현재 시스템에 존재하는 Cpuidle 상태중에서 `forced_idle_latency_limit_ns` 보다 작은 `exit_latency_ns` 를 가지는 Cpuidle 상태를 사용한다. 예를 들어, C1 = 500ns, C2 = 1000ns, C3 = 5000ns 일 때, `forced_idle_latency_limit_ns = 2500ns` 이면, `target cpudile state`는 C2가 된다.

We want to specify a latency constraint when choosing an idle state at play_idle time. Instead of duplicating the information in the structure or propagate the latency in the call stack, change the use_deepest_state by forced_latency_limit_ns to introduce this constraint. The idea being that when it is set, idle is forced (i.e. no governors), but there is a latency limit for the state to use.

- 참고 : https://lore.kernel.org/linux-pm/4032150.Wh8QACBdyO@kreacher/t/

// drivers/cpuidle/cpuidle.c - v6.5
static int find_deepest_state(struct cpuidle_driver *drv,
			      struct cpuidle_device *dev,
			      u64 max_latency_ns,
			      unsigned int forbidden_flags,
			      bool s2idle)
{
	u64 latency_req = 0;
	int i, ret = 0;

	for (i = 1; i < drv->state_count; i++) {
		struct cpuidle_state *s = &drv->states[i];

		if (dev->states_usage[i].disable ||
		    s->exit_latency_ns <= latency_req ||
		    s->exit_latency_ns > max_latency_ns ||
		    (s->flags & forbidden_flags) ||
		    (s2idle && !s->enter_s2idle))
			continue;

		latency_req = s->exit_latency_ns;
		ret = i;
	}
	return ret;
}

/**
 * cpuidle_use_deepest_state - Set/unset governor override mode.
 * @latency_limit_ns: Idle state exit latency limit (or no override if 0).
 *
 * If @latency_limit_ns is nonzero, set the current CPU to use the deepest idle
 * state with exit latency within @latency_limit_ns (override governors going
 * forward), or do not override governors if it is zero.
 */
void cpuidle_use_deepest_state(u64 latency_limit_ns)
{
	struct cpuidle_device *dev;

	preempt_disable();
	dev = cpuidle_get_device();
	if (dev)
		dev->forced_idle_latency_limit_ns = latency_limit_ns;
	preempt_enable();
}

/**
 * cpuidle_find_deepest_state - Find the deepest available idle state.
 * @drv: cpuidle driver for the given CPU.
 * @dev: cpuidle device for the given CPU.
 * @latency_limit_ns: Idle state exit latency limit
 *
 * Return: the index of the deepest available idle state.
 */
int cpuidle_find_deepest_state(struct cpuidle_driver *drv,
			       struct cpuidle_device *dev,
			       u64 latency_limit_ns)
{
	return find_deepest_state(drv, dev, latency_limit_ns, 0, false);
}

: `forced_idle_latency_limit_ns` 변수는 `cpuidle_use_deepest_state` 함수를 통해서 설정이 가능하다. 이 변수에 0이 아닌 값이 써진다면, 2가지 기능을 수행할 수 있게 된다.

1. Cpuidle governor를 거치지 않고, 강제로 가장 강력한 IDLE 상태로 진입시킨다.
2. `exit_latency_ns`의 limit을 설정해준다. 즉, 가장 강력한 IDLE 상태를 고를 때, 해당 변수값보다 크면 안된다. 이 값보다 바로 한 단계 낮은 상태가 `target idle` 상태가 된다.

: `cpuidle_find_deepest_state` / `find_deepest_state` 함수는 `max_exit_latency_ns`를 인자로 전달해서 해당값보다 한 단계 작은 `exit_latency_ns`를 가지는 IDLE 상태를 반환한다.

: `call_cpuidle` 함수는 `cpuidle_enter` 함수를 호출하기 전에 `current_clr_polling_and_test` 함수를 호출해서 현재 스레드의 `polling(TIF_POLLING_NRFLAG)`을 CLEAR하고, 현재 실행 가능한 스레드가 있는지(`tif_need_resched`)를 판단한다. `current_clr_polling_and_test` 함수가 `true`를 반환하면, 스케줄링 가능한 스레드가 있다는 뜻이다. 그러므로, 현재 CPU는 IDLE 상태로 진입하지 못하게 되므로, `dev->last_residency_ns(이전에 IDLE 상태에 얼마동안 있었는지)`를 `0`으로 설정하고, IDLE 상태로 진입은 실패했으므로, 인터럽트를 다시 활성화한다.

// include/linux/sched/idle.h - v6.5
static __always_inline void __current_set_polling(void)
{
	set_bit(TIF_POLLING_NRFLAG,
		(unsigned long *)(&current_thread_info()->flags));
}
....

static __always_inline bool __must_check current_clr_polling_and_test(void)
{
	__current_clr_polling();

	/*
	 * Polling state must be visible before we test NEED_RESCHED,
	 * paired by resched_curr()
	 */
	smp_mb__after_atomic();

	return unlikely(tif_need_resched());
}
....

// kernel/sched/idle.c - v6.5
static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
		      int next_state)
{
	/*
	 * The idle task must be scheduled, it is pointless to go to idle, just
	 * update no idle residency and return.
	 */
	if (current_clr_polling_and_test()) {
		dev->last_residency_ns = 0;
		local_irq_enable();
		return -EBUSY;
	}

	/*
	 * Enter the idle state previously returned by the governor decision.
	 * This function will block until an interrupt occurs and will take
	 * care of re-enabling the local interrupts
	 */
	return cpuidle_enter(drv, dev, next_state);
}
....

// drivers/cpuidle/cpuidle.c - v6.5
/**
 * cpuidle_enter - enter into the specified idle state
 *
 * @drv:   the cpuidle driver tied with the cpu
 * @dev:   the cpuidle device
 * @index: the index in the idle state table
 *
 * Returns the index in the idle state, < 0 in case of error.
 * The error code depends on the backend driver
 */
int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
		  int index)
{
	int ret = 0;

	/*
	 * Store the next hrtimer, which becomes either next tick or the next
	 * timer event, whatever expires first. Additionally, to make this data
	 * useful for consumers outside cpuidle, we rely on that the governor's
	 * ->select() callback have decided, whether to stop the tick or not.
	 */
	WRITE_ONCE(dev->next_hrtimer, tick_nohz_get_next_hrtimer());

	if (cpuidle_state_is_coupled(drv, index))
		ret = cpuidle_enter_state_coupled(dev, drv, index);
	else
		ret = cpuidle_enter_state(dev, drv, index);

	WRITE_ONCE(dev->next_hrtimer, 0);
	return ret;
}

: `cpuidle_enter` 함수는 벤더사에서 등록한 `cpuidle driver` 함수를 호출하는 엔트리 포인트라고 볼 수 있다. 먼저, 다음 타이머 인터럽트가 언제 발생할 지를 알려준다. `tick_nohz_get_next_timer` 함수는 `CONFIG_NO_HZ_COMMON` 컨피그가 존재하지 않을 경우, 커널에서 작성한 함수를 디폴트로 이용한다. 아래 코드에서 볼 수 있다시피, 커널 디폴트 함수는 현재 시간을 기준으로 `TICK_NSEC(tick 주기)` 만큼 시간이 흐른뒤 타이머 인터럽트가 발생한다. 즉, 의미없이 wake-up 시키는 것과 같다.

: `CONFIG_NO_HZ_COMMON` 컨피그를 설정했다면, `/kernel/time/tick-sched.c:tick_nohz_get_next_timer` 함수를 이용한다. 해당 함수의 주석에 써있지만, 내용을 정리하면 아래와 같다.

tick 이든 hrtimer이든 관계없이 다음 타이머 인터럽트는 더 먼저 발생한 쪽에 맞춰 트리거한다. 만약, `periodic tick`이 비활성화했다면, 다음 타이머 인터럽트는 `tick`이 아닌 `hrtimer`가 발생하는 시점이 된다.

// include/vdso/jiffies.h - v6.5
/* TICK_NSEC is the time between ticks in nsec assuming SHIFTED_HZ */
#define TICK_NSEC ((NSEC_PER_SEC+HZ/2)/HZ)
....

//include/linux/tick.h - v6.5
#ifdef CONFIG_NO_HZ_COMMON
....

extern ktime_t tick_nohz_get_next_hrtimer(void);
....

#else /* !CONFIG_NO_HZ_COMMON */
....
static inline ktime_t tick_nohz_get_next_hrtimer(void)
{
	/* Next wake up is the tick period, assume it starts now */
	return ktime_add(ktime_get(), TICK_NSEC);
}
....
#endif /* !CONFIG_NO_HZ_COMMON */

// kernel/time/tick-sched.c - v6.5
#ifdef CONFIG_NO_HZ_COMMON
....
/**
 * tick_nohz_get_next_hrtimer - return the next expiration time for the hrtimer
 * or the tick, whatever that expires first. Note that, if the tick has been
 * stopped, it returns the next hrtimer.
 *
 * Called from power state control code with interrupts disabled
 */
ktime_t tick_nohz_get_next_hrtimer(void)
{
	return __this_cpu_read(tick_cpu_device.evtdev)->next_event;
}
....
#endif /* CONFIG_NO_HZ_COMMON */

: 다시 돌아와서, `cpuidle_enter` 함수는 cpu의 커플링을 판단해서, 어떻게 IDLE로 보낼지를 결정하게 된다. 그런데, `cpuidle_enter_state_coupled` 함수도 결국에는 `cpuidle_enter_state` 함수를 호출하기 때문에, 우리는 `cpuidle_enter_state` 함수만 볼 것이다.

: `cpuidle_enter_state` 함수는 실제 벤더사에서 작성된 `target_state->enter` 함수가 호출되는 지점이다. 즉, 실제 이 함수가 `CPU idle`의 최종점이라고 볼 수 있다.

// drivers/cpuidle/cpuidle.c - v6.5
/**
 * cpuidle_enter_state - enter the state and update stats
 * @dev: cpuidle device for this cpu
 * @drv: cpuidle driver for this cpu
 * @index: index into the states table in @drv of the state to enter
 */
noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
				 struct cpuidle_driver *drv,
				 int index)
{
	int entered_state;

	struct cpuidle_state *target_state = &drv->states[index];
	bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); // --- 1
	ktime_t time_start, time_end;

	instrumentation_begin();

	/*
	 * Tell the time framework to switch to a broadcast timer because our
	 * local timer will be shut down.  If a local timer is used from another
	 * CPU as a broadcast timer, this call may fail if it is not available.
	 */
	if (broadcast && tick_broadcast_enter()) { // --- 2
		index = find_deepest_state(drv, dev, target_state->exit_latency_ns,
					   CPUIDLE_FLAG_TIMER_STOP, false);
		if (index < 0) {
			default_idle_call();
			return -EBUSY;
		}
		target_state = &drv->states[index];
		broadcast = false;
	}

	if (target_state->flags & CPUIDLE_FLAG_TLB_FLUSHED) // --- 3
		leave_mm(dev->cpu);

	/* Take note of the planned idle state. */
	sched_idle_set_state(target_state); // --- 4

	trace_cpu_idle(index, dev->cpu);
	time_start = ns_to_ktime(local_clock_noinstr());

	stop_critical_timings();
	if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) { // --- 5
		ct_cpuidle_enter();
		/* Annotate away the indirect call */
		instrumentation_begin();
	}

	/*
	 * NOTE!!
	 *
	 * For cpuidle_state::enter() methods that do *NOT* set
	 * CPUIDLE_FLAG_RCU_IDLE RCU will be disabled here and these functions
	 * must be marked either noinstr or __cpuidle.
	 *
	 * For cpuidle_state::enter() methods that *DO* set
	 * CPUIDLE_FLAG_RCU_IDLE this isn't required, but they must mark the
	 * function calling ct_cpuidle_enter() as noinstr/__cpuidle and all
	 * functions called within the RCU-idle region.
	 */
	entered_state = target_state->enter(dev, drv, index); // --- 6

	if (WARN_ONCE(!irqs_disabled(), "%ps leaked IRQ state", target_state->enter))
		raw_local_irq_disable();

	if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) {
		instrumentation_end();
		ct_cpuidle_exit();
	}
	start_critical_timings();

	sched_clock_idle_wakeup_event();
	time_end = ns_to_ktime(local_clock_noinstr());
	trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);

	/* The cpu is no longer idle or about to enter idle. */
	sched_idle_set_state(NULL);

	if (broadcast)
		tick_broadcast_exit();

	if (!cpuidle_state_is_coupled(drv, index))
		local_irq_enable();

	if (entered_state >= 0) { // --- 7
		s64 diff, delay = drv->states[entered_state].exit_latency_ns;
		int i;

		/*
		 * Update cpuidle counters
		 * This can be moved to within driver enter routine,
		 * but that results in multiple copies of same code.
		 */
		diff = ktime_sub(time_end, time_start); // --- 8

		dev->last_residency_ns = diff;
		dev->states_usage[entered_state].time_ns += diff;
		dev->states_usage[entered_state].usage++;

		if (diff < drv->states[entered_state].target_residency_ns) { // --- 9
			for (i = entered_state - 1; i >= 0; i--) {
				if (dev->states_usage[i].disable)
					continue;

				/* Shallower states are enabled, so update. */
				dev->states_usage[entered_state].above++;
				trace_cpu_idle_miss(dev->cpu, entered_state, false);
				break;
			}
		} else if (diff > delay) { // --- 10
			for (i = entered_state + 1; i < drv->state_count; i++) {
				if (dev->states_usage[i].disable)
					continue;

				/*
				 * Update if a deeper state would have been a
				 * better match for the observed idle duration.
				 */
				if (diff - delay >= drv->states[i].target_residency_ns) {
					dev->states_usage[entered_state].below++;
					trace_cpu_idle_miss(dev->cpu, entered_state, true);
				}

				break;
			}
		}
	} else { // --- 11
		dev->last_residency_ns = 0;
		dev->states_usage[index].rejected++;
	}

	instrumentation_end();

	return entered_state;
}

1. `cpuidle_enter_state` 함수가 호출되었다는 것은 shallow-sleep(`polling`)이 아닌, deeper-sleep으로 들어간다는 것을 의미한다. 즉, Local Timer를 stop하고, broadcast timer를 켜야한다. Cpudile 상태중에서 `CPUIDLE_FLAG_TIMER_STOP` 플래그 설정되어 있다면, 해당 IDLE 상태는 `deeper-sleep`을 의미한다.

2. `tick_broadcast_enter` 함수는 현재 CPU의 로컬 타이머를 STOP하고, `oneshot mode`를 지원하는 브로드 캐스트 타이머를 설정한다. 그런데, 만약에 현재 CPU의 로컬 타이머가 다른 CPU의 `브로드 캐스트 타이머`로 이미 설정되어 있다면, `tick_broadcast_timer` 함수는 `false`를 반환한다.

: 만약, 해당 CPU가 로컬 타이머를 끄고, 브로드 캐스트 타이머를 얻었다면, `find_deepest_state` 함수를 호출해서 가장 깊게 들어갈 수 있는 IDLE 상태를 얻어와야 한다. 이전에 `cpuidle_select` 함수를 통해 얻어온 `target_state`는

3. `CPUIDLE_FLAG_TLB_FLUSHED`는 CPU가 IDLE 상태로 진입하기전에, TLB를 모두 flushed 하고 들어갈 것인지를 묻는다. 주로 x86만 사용하는 함수다[참고1].

4. 현재 CPU의 런큐의 IDLE STATE를 설정한다. 뒤에서 다시 다룬다.

5. 현재 CPU가 IDLE 상태로 진입하는데 있어서, 처리해야 할 RCU 작업들이 있는지 체크한다.

6. `target_state->enter` 함수는 CPU 벤더사는 리눅스 커널이 사용할 수 있는 `struct cpuidle_driver`를 하나와 각 CPU 코어들이 진입할 수 있는 다양한 IDLE 상태를 함께 제공해야 한다. 이 때, 각 상태에 `enter` 콜백 함수가 존재하는데, 이 함수는 최종적으로 해당 CPU를 IDLE 상태로 보낸다. 이 함수가 호출되는 시점을 기준으로 뒤 코드는 wake-up 되지 않는한, 실행되지 않는다.이 내용은 뒤에서 다시 알아보기로 한다.

7. CPU가 IDLE 상태로 정상적으로 진입했었다면, 추후 cpuidle governor가 IDLE 상태를 선택하는데 있어서, 정보를 제공해주기 위해 이전 IDLE 상태에 대한 정보를 저장한다.

8. 현재 CPU가 IDLE 상태에 얼마나 있었는지를 저장한다(`dev->last_residency_ns = diff`).

9. CPU가 IDLE 상태에 있어야 할 시간보다 더 적게 있었다면, cpuidle governor가 IDLE 상태를 잘못 select 했다고 판단한다. 즉, `예측한 상태가 실제로 측정된 수치보다 너무 깊다`로 판단한다(`dev->states_usage[entered_state].above++`).

10. 현재 CPU가 IDLE 상태에 있었던 시간이 벤더사에서 설정한 `exit_latency_ns` 보다 더 오래있었다면, `CPU idle governor`에게 좀 더 강한 IDLE 상태를 선택하라고 힌트를 준다. `diff - delay >= drv->states[i].target_residency_ns`는 아래와 같이 해석된다.

`실제 CPU가 IDLE 상태를 머문 시간이 `벤더사에서 권고한 target_residency_ns + exit_latency_ns`보다 크거나 같을 경우를 의미한다. 예를 들어, 벤더사에서 C3 상태에서 대해 exit_latency_ns를 500ns로, target_residency_ns를 2500ns로 두었을 때, 실제 CPU가 1s를 IDLE 상태에 있었다면, 다음번에 CPU idle governor에게 더 강력한 C4 이상의 IDLE 상태를 선택할 수 있도록 힌트를 제공하는 것이다. 이 때, 힌트가 바로 `dev->states_usage[entered_state].below++`가 된다.

11. 현재 CPU가 IDLE 상태에 들어가지 못한 상황이다. IDLE 상태에 들어간 적이 없으므로, `dev->last_residency_ns = 0`를 설정하고, IDLE 상태 진입을 실패했다는 의미에서 `dev->states_usage[index].rejected` 값을 증가시킨다.

: `idle process`를 다루면서, 런큐를 다루지 않을 수 가 없다. 리눅스 커널에서는 CPU 하나 당 하나의 런큐를 가지고 있다. 이 말은 런큐와 CPU가 1:1 대응된다고 볼 수 있다. 이 말은 런큐 자료 구조에 `struct cpudile_state`가 현재 CPU의 CPU idle 상태를 나타낸다고 볼 수 있다. `sched_idle_set_state` 함수는 내부적으로 `idle_set_state` 함수를 호출해서, 현재 CPU의 런큐를 얻어온다. 그리고, 현재 CPU가 진입해야 하는 CPU idle 상태를 런큐에 저장한다. 중요한 건 이 값이 현재 CPU가 IDLE 상태라는 것을 나타내는 기준이 아니라는 것이다. 이 값은 단지 스케줄러가 여러 IDLE 상태로 진입할 수 있는 CPU중에서 어떤 CPU가 가장 깊게 IDLE 상태를 들어갈 수 있는지를 선택하는데 중요한 정보를 제공한다(`find_idlest_group_cpu` 함수 참고). 그렇다면, 현재 CPU가 IDLE 상태에 있는 것은 어떻게 알 수 있을까?

// /kernel/sched/sched.h - v6.5
....
/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	....
    	unsigned int		nr_running;
	....
	struct cfs_rq		cfs;
	struct rt_rq		rt;
	struct dl_rq		dl;
	....
   
   	struct task_struct __rcu	*curr;
	struct task_struct	*idle;
    	....
    
    	/* CPU of this runqueue: */
	int			cpu;
	int			online;    
    	.....    
        
#ifdef CONFIG_CPU_IDLE
	/* Must be inspected within a rcu lock section */
	struct cpuidle_state	*idle_state;
#endif
	....
}

#ifdef CONFIG_CPU_IDLE
static inline void idle_set_state(struct rq *rq,
				  struct cpuidle_state *idle_state)
{
	rq->idle_state = idle_state;
}

static inline struct cpuidle_state *idle_get_state(struct rq *rq)
{
	SCHED_WARN_ON(!rcu_read_lock_held());

	return rq->idle_state;
}
#else
....

//kernel/sched/idle.c - v6.5
/**
 * sched_idle_set_state - Record idle state for the current CPU.
 * @idle_state: State to record.
 */
void sched_idle_set_state(struct cpuidle_state *idle_state)
{
	idle_set_state(this_rq(), idle_state);
}

: `idle_cpu` 함수를 통해서 알 수 있다[참고1]. CPU가 IDLE 상태인지 판단하는 조건은 총 3가지로 구성되어 있다. 이 3가지가 모두 참이여야 IDLE 상태라 부를 수 있다.

1. 현재 실행중 인 스레드가 `idle process`여야 한다.
2. 런큐에 실행가능한 스레드가 없어야 한다.
3. 런큐에 처리하지 못한 `try-to-wake-up`이 없어야 한다.

// kernel/sched/core.c - v6.5
/**
 * idle_cpu - is a given CPU idle currently?
 * @cpu: the processor in question.
 *
 * Return: 1 if the CPU is currently idle. 0 otherwise.
 */
int idle_cpu(int cpu)
{
	struct rq *rq = cpu_rq(cpu);

	if (rq->curr != rq->idle)
		return 0;

	if (rq->nr_running)
		return 0;

#ifdef CONFIG_SMP
	if (rq->ttwu_pending)
		return 0;
#endif

	return 1;
}

: 그런데, 맨 마지막에 설명할 내용이지만, 아무리 찾아봐도 `cpuidle_enter_state:target_state->enter` 코드가 실행되기 전에 local_irq_disable 만 봤지, local_irq_enable 함수가 호출되는 것을 본적이 없다. 어떻게 idle 상태에서 빠져나 올 수 있을까? `cpuidle_enter_state:target_state->enter` 코드가 실행되면, 각 제조사가 제작한 cpuidle_driver->enter 함수가 호출된다. arm 같은 경우에는 `/drivers/cpuidle/cpuidle-psci.c` 파일은 참고하면 된다. 결론만 말하면, local interrupt 가 disabled 상황에서 wake-up 이 가능한 이유는 arm 같은 경우는 `WFI` 때문에 가능하다.

- Polling idle

: `do_idle` 함수에서 CPU가 IDLE 상태를 모니터링하는 방식이 polling 이라면, `cpu_idle_poll` 함수가 호출된다. 루프의 조건문은 이 함수를 호출한 `cpuidle_idle_call`과 크게 다르지 않다. 단지, `tif_need_resched` 함수만 추가되었다. 즉, 실행 가능한 스레드가 있으면, 즉각적으로 루프를 종료한다. 만약, 실행 가능한 스레드도 없고, `wake-up` 이벤트(브로드 캐스트 타이머) 또한 없다면, `cpu_relax` 함수를 호출해서 CPU를 낮은 IDLE 상태로 진입시킨다.

// kernel/sched/idle.c - v6.5
#ifdef CONFIG_GENERIC_IDLE_POLL_SETUP
static int __init cpu_idle_poll_setup(char *__unused)
{
	cpu_idle_force_poll = 1;

	return 1;
}
__setup("nohlt", cpu_idle_poll_setup);

static int __init cpu_idle_nopoll_setup(char *__unused)
{
	cpu_idle_force_poll = 0;

	return 1;
}
__setup("hlt", cpu_idle_nopoll_setup);
#endif

static noinline int __cpuidle cpu_idle_poll(void)
{
	....

	raw_local_irq_enable();
	while (!tif_need_resched() &&
	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
		cpu_relax();
	raw_local_irq_disable();

	....

	return 1;
}

: `cpu_relax` 함수는 아키텍처 종속적인 함수로 칩 벤더사에서 구현하는 부분이다. 자세한 내용은 이 글을 참고하자.

저작자표시 비영리 변경금지 (새창열림)

'Linux > kernel' 카테고리의 다른 글

[리눅스 커널] PM - restart & shutdown & halt (0)	2023.09.18
[리눅스 커널] Timer - Broadcast timer (0)	2023.09.17
[리눅스 커널] PM QoS - CPU latency QoS framework (0)	2023.09.11
[리눅스 커널] Timer - Dynamic tick(tick sched) (0)	2023.09.08
[리눅스 커널] Wait queue & condition (0)	2023.09.08

ABOUT ME

Ease is the greatest threat Ease is the greatest threat

글의 참고

글의 전제

글의 내용

- idle process

- do_idle

- Polling idle

'Linux > kernel' 카테고리의 다른 글

티스토리툴바

ABOUT ME

글의 참고

글의 전제

글의 내용

- idle process

- do_idle

- Polling idle

'Linux > kernel' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바