Bare Assembly Bring-Up on the CH32V003

The CH32V003 is strange in a useful way: a RISC-V microcontroller with 16 KiB of flash, 2 KiB of SRAM, GPIO, timers, UART, SPI, I2C, and a price low enough to make experiments feel cheap.

Most projects start from a vendor SDK, an IDE template, or a prepared C runtime. That is practical, but it hides the most interesting part: what happens between reset and the first useful instruction of your program.

Here we do the opposite. No C runtime. No vendor startup file. No library. Just one assembly file, one linker script, the CH32V003 reference manual, and a minimal loop that toggles a GPIO pin.

The goal is not to write a complete firmware. The goal is to understand the boot path well enough that the rest of the firmware no longer feels magical.

Hardware and Tools

You need:

  • a CH32V003 board;
  • a WCH-Link or compatible probe;
  • an LED connected to PD4, or a board whose user LED is already on that pin;
  • a bare-metal RISC-V toolchain with riscv32-unknown-elf-as, riscv32-unknown-elf-ld, and riscv32-unknown-elf-objcopy;
  • a flashing tool such as minichlink.

The CH32V003 uses a QingKe RV32E core. With GNU binutils, a sensible assembler line is:

riscv32-unknown-elf-as -g -mabi=ilp32e -march=rv32ec_zicsr \
  -o startup.o startup.S

Two options matter:

  • -march=rv32ec_zicsr: RV32E core, compressed instructions, and CSR access;
  • -mabi=ilp32e: ABI for RV32E, where only integer registers x0 through x15 exist.

Reference Manual Sections

For a minimal bring-up, keep these parts of the CH32V003 reference manual open:

  • memory map: 16 KiB of Code Flash and 2 KiB of SRAM starting at 0x20000000;
  • RCC: peripheral clock enables, especially R32_RCC_APB2PCENR at 0x40021018;
  • PFIC: exception and interrupt vector table, plus the mtvec CSR;
  • GPIO: GPIOx_CFGLR, GPIOx_OUTDR, and GPIOx_BSHR;
  • FLASH: R32_FLASH_ACTLR at 0x40022000 for flash latency when increasing the system clock.

The model is direct: peripherals are controlled through memory-mapped registers. To blink an LED, enable the GPIO port clock, configure the pin as an output, then write to the output register.

Minimal Linker Script

The linker must place the vector table at the start of the image and expose the RAM top for the stack.

A minimal linker script looks like this:

ENTRY(vector_base)

MEMORY
{
    flash (rx)  : ORIGIN = 0x00000000, LENGTH = 16K
    ram   (xrw) : ORIGIN = 0x20000000, LENGTH = 2K
}

SECTIONS
{
    . = 0x00000000;
    .vectors :
    {
        KEEP(*(.vectors))
    } > flash

    .text :
    {
        *(.text*)
        *(.rodata*)
    } > flash

    . = ALIGN(4);
    .bss (NOLOAD) :
    {
        _bss_start = .;
        *(.bss*)
        *(COMMON)
        . = ALIGN(4);
        _bss_end = .;
    } > ram

    _ram_start = ORIGIN(ram);
    _ram_end   = ORIGIN(ram) + LENGTH(ram);
    _stack_top = _ram_end;
}

_stack_top becomes 0x20000800: SRAM starts at 0x20000000 and is 2 KiB long. The stack grows downward from there.

KEEP(*(.vectors)) forces the vector table to remain in the output image. In a bare-metal program, the table may not look referenced from normal code, but the hardware depends on it directly after reset.

Vector Table

The CH32V003 expects 4-byte vector entries: 0x00000000, 0x00000004, 0x00000008, and so on. For the first step, only the reset entry matters: it must transfer control to _start.

.section .vectors, "ax"
.option norvc
.align 2

.globl vector_base
vector_base:
    j _start

    .rept 255
        j default_handler
    .endr

default_handler:
    j default_handler

.option rvc

The critical detail is .option norvc. The core supports 16-bit compressed instructions, but the vector table must remain a table of 4-byte entries. If j _start were assembled as a compressed instruction, the table layout would be wrong.

For a first boot, every non-reset entry can point to an infinite loop. Interrupts can wait.

The _start Entry Point

After reset, put the processor into a known state:

  1. disable interrupts;
  2. initialize sp;
  3. clear .bss;
  4. initialize the GPIO;
  5. enter the main loop.
.section .text
.align 2

.globl _start
_start:
    csrci mstatus, 0x8

    la   sp, _stack_top

    la   t0, _bss_start
    la   t1, _bss_end

bss_clear_loop:
    bgeu t0, t1, bss_clear_done
    sw   zero, 0(t0)
    addi t0, t0, 4
    j    bss_clear_loop

bss_clear_done:
    call led_init
    j    main_loop

Even if the first program does not use global variables, clearing .bss is the right habit. Once you add a counter, buffer, or global state variable, it starts from zero instead of whatever happened to be in RAM.

Configure PD4 as an Output

This example uses PD4 as the LED output.

From the reference manual:

  • R32_RCC_APB2PCENR is at 0x40021018;
  • IOPDEN, bit 5, enables the port D clock;
  • R32_GPIOD_CFGLR is at 0x40011400;
  • R32_GPIOD_OUTDR is at 0x4001140C;
  • each pin uses 4 bits in GPIOx_CFGLR.

For PD4, the configuration field starts at bit 4 * 4 = 16. The value 0b0001 configures the pin as a 10 MHz push-pull output.

.equ RCC_APB2PCENR, 0x40021018
.equ RCC_IOPDEN,    (1 << 5)
.equ GPIOD_CFGLR,   0x40011400
.equ GPIOD_OUTDR,   0x4001140C
.equ LED_PIN,       4

.globl led_init
led_init:
    li   t0, RCC_APB2PCENR
    lw   t1, 0(t0)
    li   t2, RCC_IOPDEN
    or   t1, t1, t2
    sw   t1, 0(t0)

    li   t0, GPIOD_CFGLR
    lw   t1, 0(t0)
    li   t2, ~(0xF << 16)
    and  t1, t1, t2
    li   t2, (0x1 << 16)
    or   t1, t1, t2
    sw   t1, 0(t0)

    ret

The pattern is deliberate: read the register, clear only the field you need, set the new value, then write it back. That avoids clobbering the configuration of other pins on the same port.

Minimal Work Loop

Now that the pin is configured, the smallest visible workload is a blink loop:

main_loop:
    call led_toggle
    call delay
    j    main_loop

led_toggle:
    li   t0, GPIOD_OUTDR
    lw   t1, 0(t0)
    li   t2, (1 << LED_PIN)
    xor  t1, t1, t2
    sw   t1, 0(t0)
    ret

delay:
    li   t0, 100000
delay_loop:
    addi t0, t0, -1
    bnez t0, delay_loop
    ret

This delay is not a timer. It depends on CPU frequency and instruction execution. For first bring-up, that is fine: the goal is a visible proof that reset, vector dispatch, _start, and GPIO writes all work.

Complete Program

Here is a standalone startup.S:

.equ RCC_APB2PCENR, 0x40021018
.equ RCC_IOPDEN,    (1 << 5)
.equ GPIOD_CFGLR,   0x40011400
.equ GPIOD_OUTDR,   0x4001140C
.equ LED_PIN,       4

.section .vectors, "ax"
.option norvc
.align 2

.globl vector_base
vector_base:
    j _start

    .rept 255
        j default_handler
    .endr

default_handler:
    j default_handler

.option rvc

.section .text
.align 2

.globl _start
_start:
    csrci mstatus, 0x8
    la   sp, _stack_top

    la   t0, _bss_start
    la   t1, _bss_end

1:
    bgeu t0, t1, 2f
    sw   zero, 0(t0)
    addi t0, t0, 4
    j    1b

2:
    call led_init

main_loop:
    call led_toggle
    call delay
    j    main_loop

led_init:
    li   t0, RCC_APB2PCENR
    lw   t1, 0(t0)
    li   t2, RCC_IOPDEN
    or   t1, t1, t2
    sw   t1, 0(t0)

    li   t0, GPIOD_CFGLR
    lw   t1, 0(t0)
    li   t2, ~(0xF << 16)
    and  t1, t1, t2
    li   t2, (0x1 << 16)
    or   t1, t1, t2
    sw   t1, 0(t0)
    ret

led_toggle:
    li   t0, GPIOD_OUTDR
    lw   t1, 0(t0)
    li   t2, (1 << LED_PIN)
    xor  t1, t1, t2
    sw   t1, 0(t0)
    ret

delay:
    li   t0, 100000
1:
    addi t0, t0, -1
    bnez t0, 1b
    ret

Build it with:

riscv32-unknown-elf-as -g -mabi=ilp32e -march=rv32ec_zicsr \
  -o startup.o startup.S

riscv32-unknown-elf-ld -g -T ch32v003-min.ld \
  startup.o -o blink.elf

riscv32-unknown-elf-objcopy -O binary blink.elf blink.bin

Flash it:

minichlink -w blink.bin flash -b -r

If the LED blinks, the base path is validated: the image is linked at the right address, the vector table is usable, the stack is initialized, code executes from flash, and the GPIO registers respond.

Next Steps: Clock, Flash, Interrupts

The program above can run from the reset clock. A common next step is switching to a faster system clock, typically through HSE and PLL.

Before increasing the clock, check FLASH_ACTLR.LATENCY. The manual specifies:

  • 00: 0 wait states, recommended up to 24 MHz;
  • 01: 1 wait state, recommended up to 48 MHz.

Set flash latency before speeding up the core. Otherwise the CPU may request instructions faster than flash can deliver them.

Interrupts come after that. The mtvec CSR (0x305) contains the vector table base address. In offset-by-interrupt-number mode, hardware jumps to:

base + cause * 4

That is why the vector table must stay aligned and made of 4-byte entries.

On the CH32V003, the INTSYSCR CSR (0x804) can also enable hardware stacking and interrupt nesting. That is useful for a more serious firmware, but it is not required for the first blink.

Bring-Up Checklist

When nothing starts:

  1. check that .vectors is placed at 0x00000000;
  2. check that vector entries are 4 bytes wide;
  3. check that _stack_top is 0x20000800;
  4. keep interrupts disabled until handlers are ready;
  5. start with one GPIO before UART, SPI, or timers;
  6. enable the GPIO port clock before touching GPIO registers;
  7. switch to 48 MHz only after setting FLASH_ACTLR.LATENCY;
  8. add mtvec, PFIC, and real interrupts only after the minimal boot path is stable.

This is intentionally small. On a chip with 16 KiB of flash and 2 KiB of SRAM, understanding the first ten instructions is often more valuable than starting from an opaque software stack.