pext/pdep are incredible, I'm hoping to see them in more SIMD ISAs in the future.
But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning
I always liked rlwimi on PowerPC. It rotates the source n bits, then writes any contiguous section of bits over the corresponding bits in the destination register. This allows copying any bitfield from any position in one register into another. Basically either of these:
out = (out & ~mask) | (in << shift & mask)
out = (out & ~mask) | (in >> shift & mask)
Z80's EXX to swap with the shadow registers was interesting (meant for fast interrupt response so you didn't have to save registers to memory).
Definitely a nice and pretty much pioneering feature on PowerPC in 1994 (and I guess RS/6000 before that, but I never used one).
Today's Arm64 BFM does both those jobs in one, minus the ability to create a split mask via rotating, but plus adding a choice of sign or zero extension to extracted fields (including extracted to the same place they already were, for pure sign/zero extension). As a result it's got about 100 aliases.
It would be nice to have these in RISC-V but they seriously violate the quite strict "Stanford Standard RISC" 2R1W principle that keeps the RISC-V integer pipeline simple (smaller, faster, cheaper).
When working in the "B" extension working group I suggested adopting the M88000 bitfield instructions which follow the 2R1W principle. Someone had an objection to encoding both field width and offset into a single constant (or `Rs2`), though I think it's well worth it. M88k as a 32 bit ISA used 5 bits for each, but 6 bits for each for RV64 fits RISC-V's 12 bit immediates perfectly.
- ext / extu: Extract signed or unsigned bit field from a register. You specify offset (starting bit position) and width. The extracted field is right-justified (shifted to the low bits) in the destination, with sign-extension or zero-extension.
- mak: Make (insert) a bit field. Takes a value, shifts it left by the offset, and inserts it into the destination while clearing the target field first (or combining in specific ways).
- set: Set (force to 1) a contiguous bit field in a register.
- clr: Clear (force to 0) a contiguous bit field in a register.
All take `Rd`, `Rs1` and a field size:offset as either a literal or as `Rs2`.
Unfortunately, the R-type `mak` violates 2R1W because the `Rd` is also a source, which complicates OoO implementations making them 3R1W. RISC-V could use an alternative formulation in which `mak` (or some other name` masks off the source field and shifts it into place, and then the insert is completed using `clr` and `or`.
On the other hand the forms with 12 bit literals are expensive in encoding space, but even including just the `Rs2` versions would be great, especially as often several instructions in a row can use the same field specification, which fits `addi Rd,zero,imm12` (aka `li`) perfectly.
On the gripping hand, while the immediate version of `mak` violates RISC-V convention by making the `Rd` also a source, any real pipeline is going to have fields for all of `Rd`, `Rs1`, `Rs2`, and `imm32` so only the decoder is affected.
Also, `ext` / `extu` are not needed as a pair of C-extension shifts do the same job with the same code size, and can be decoded into a single µop on a higher end CPU if desired.
As an example: take a 10 bit field at offset 21 and insert into a destination at offset 1 (this is part of decoding RISC-V J/JAL instructions).
PowerPC:
rlwimi r4, r3, 11, 1, 10
Arm64:
ubfx x2, x0, #21, #10 # extract bits[30:21] → low 10 bits of x2 (unsigned)
bfi x1, x2, #1, #10 # insert those 10 bits into x1 starting at bit 1
Alternatively, using `bfm` directly without aliases (exactly the same instructions, just trickier to get right)
bfm x2, x0, #21, #30
bfm x1, x2, #63-1, #9
M88k:
extu r3, r1, 21, 10 # extract 10-bit field starting at bit 21 → low bits of r3
mak r2, r3, 1, 10 # make/insert the field at bit 1 in destination
RISC-V:
srli x12, x10, 21 # shift field down to low bits
andi x12, x12, 0x3FF # mask to 10 bits
slli x12, x12, 1 # position at bit 1 (for imm[10:1])
li x13, ~0x7FE # mask to clear bits [10:1] only
and x11, x11, x13
or x11, x11, x12 # insert the field
RISC-V with some M88k inspiration:
extui r3, r1, 21, 10 # extract 10-bit field starting at bit 21 → low bits of r3
maki r4, r3, 1, 10 # modified mak: masks + shifts field to bits [10:1] (others 0)
clri r2, 1, 10 # clear the target field in destination
or r2, r2, r4 # insert the prepared field
Alternatively
li t0, (1<<6) | 10 # specification for insertion bit field
srli a3, a1, 21 # shift 10-bit field starting at bit 21 → low bits of r3
mak a4, a3, t0 # modified mak: masks + shifts field to bits [10:1] (others 0)
clr a2, t0 # clear the target field in destination
or a2, a2, r4 # insert the prepared field
Again, this last formulation of `maki` violates RISC-V instruction format convention in making `a2` both src and dst, BUT if the decoder handles that then the expanded form does NOT cause any issues with the pipeline implementation.
pext/pdep are incredible, I'm hoping to see them in more SIMD ISAs in the future.
But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning
I always liked rlwimi on PowerPC. It rotates the source n bits, then writes any contiguous section of bits over the corresponding bits in the destination register. This allows copying any bitfield from any position in one register into another. Basically either of these:
Z80's EXX to swap with the shadow registers was interesting (meant for fast interrupt response so you didn't have to save registers to memory).rlwimi was a nice one, especially for emulators.
And it also had eieio, Enforce In-Order Execution of I/O.
> rlwimi / rlwinm
Definitely a nice and pretty much pioneering feature on PowerPC in 1994 (and I guess RS/6000 before that, but I never used one).
Today's Arm64 BFM does both those jobs in one, minus the ability to create a split mask via rotating, but plus adding a choice of sign or zero extension to extracted fields (including extracted to the same place they already were, for pure sign/zero extension). As a result it's got about 100 aliases.
It would be nice to have these in RISC-V but they seriously violate the quite strict "Stanford Standard RISC" 2R1W principle that keeps the RISC-V integer pipeline simple (smaller, faster, cheaper).
When working in the "B" extension working group I suggested adopting the M88000 bitfield instructions which follow the 2R1W principle. Someone had an objection to encoding both field width and offset into a single constant (or `Rs2`), though I think it's well worth it. M88k as a 32 bit ISA used 5 bits for each, but 6 bits for each for RV64 fits RISC-V's 12 bit immediates perfectly.
- ext / extu: Extract signed or unsigned bit field from a register. You specify offset (starting bit position) and width. The extracted field is right-justified (shifted to the low bits) in the destination, with sign-extension or zero-extension.
- mak: Make (insert) a bit field. Takes a value, shifts it left by the offset, and inserts it into the destination while clearing the target field first (or combining in specific ways).
- set: Set (force to 1) a contiguous bit field in a register.
- clr: Clear (force to 0) a contiguous bit field in a register.
All take `Rd`, `Rs1` and a field size:offset as either a literal or as `Rs2`.
Unfortunately, the R-type `mak` violates 2R1W because the `Rd` is also a source, which complicates OoO implementations making them 3R1W. RISC-V could use an alternative formulation in which `mak` (or some other name` masks off the source field and shifts it into place, and then the insert is completed using `clr` and `or`.
On the other hand the forms with 12 bit literals are expensive in encoding space, but even including just the `Rs2` versions would be great, especially as often several instructions in a row can use the same field specification, which fits `addi Rd,zero,imm12` (aka `li`) perfectly.
On the gripping hand, while the immediate version of `mak` violates RISC-V convention by making the `Rd` also a source, any real pipeline is going to have fields for all of `Rd`, `Rs1`, `Rs2`, and `imm32` so only the decoder is affected.
Also, `ext` / `extu` are not needed as a pair of C-extension shifts do the same job with the same code size, and can be decoded into a single µop on a higher end CPU if desired.
As an example: take a 10 bit field at offset 21 and insert into a destination at offset 1 (this is part of decoding RISC-V J/JAL instructions).
PowerPC:
Arm64: Alternatively, using `bfm` directly without aliases (exactly the same instructions, just trickier to get right) M88k: RISC-V: RISC-V with some M88k inspiration: Alternatively Alternatively: Again, this last formulation of `maki` violates RISC-V instruction format convention in making `a2` both src and dst, BUT if the decoder handles that then the expanded form does NOT cause any issues with the pipeline implementation.bitfield insert/extract was also looked at by the scalar efficiency SIG: https://lists.riscv.org/g/sig-scalar-efficiency/topic/115060...
IIRC it didn't go anywere, because it wasn't worth the encoding space.
But a rlwimi sounds like a good candidate for >32b encoding.
HCF - Halt and Catch Fire.