It wouldn't be surprising if the RP2350 gets officially certified to run at something above the max supported clock at launch (150MHz), though obviously nothing close to 800MHz. That happened to the RP2040[1], which at launch nominally supported 133MHz but now it's up to 200MHz (the SDK still defaults to 125MHz for compatibility, but getting 200MHz is as simple as toggling a config flag[2]).
The 300MHz, 400MHz, and 500MHz points requiring only 1.1, 1.3, and 1.5v and with only the last point getting slightly above body temperature, even with no cooling, seem like something that should maybe not be "officially" supported, but maybe mentioned somewhere in an official blog post or docs. Getting 3x+ the performance with some config changes is noteworthy. It would be interesting to run an experiment to see if there's any measurable degradation of stability or increased likelihood at failure at those settings compared to a stock unit running the same workload for the same time.
Aurornis 1 days ago [-]
All of their reliability testing and validation happens at the lower voltages and speeds. I doubt they'd include anything in the official docs lest they be accused of officially endorsing something that might later turn out to reduce longevity.
londons_explore 1 days ago [-]
When pushing clock speeds, things get nondeterministic...
Here is an idea for a CPU designer...
Observe that you can get way more performance (increased clock speed) or more performance per watt (lower core voltage) if you are happy to lose reliability.
Also observe that many CPU's do superscalar out of order execution, which requires having the ability to backtrack, and this is normally implemented with a queue and a 'commit' phase.
Finally, observe that verifying this commit queue is a fully parallel operation, and therefore can be checked slower and in a more power efficient way.
So, here's the idea. You run a blazing fast superscalar CPU, well past the safe clock speed limits that makes hundreds of computation or flow control mistakes per second. You have slow but parallel verification circuitry to verify the execution trace. Whenever a mistake is made, you put a pipeline bubble in the main CPU, clear the commit queue, you put in the correct result from the verification system, and continue - just like you would with a branch misprediction.
This happening a few hundred times per second will have a negligible impact on performance. (consider 100 cycles 'reset' penalty, 100*100 is a tiny fraction of 4Ghz)
The main fast CPU could also make deliberate mistakes - for example assuming floats aren't NaN, assuming division won't be by zero, etc. Trimming off rarely used logic makes the core smaller, making it easier to make it even faster or more power efficient (since wire length determines power consumption per bit).
You could run an LLM like this, and the temperature parameter would become an actual thing...
boznz 1 days ago [-]
Totally logical, especially with some sort of thermal mass, as you can throttle down the clock when quiet to cool down after, I used this concept in my first sci-fi novel where the AI was aware of its temperature for these reasons. I run my Pico2 board in my MP3 jukebox at 250Mhz, it has been on for several weeks without missing a beat (pun intended)
tliltocatl 1 days ago [-]
LLM are memory-bandwidth bound so higher core frequency would not help much.
ssl-3 1 days ago [-]
How do we know if a computation is a mistake? Do we verify every computation?
If so, then:
That seems like it would slow the ultimate computation to no more than rate rate at which they can be these computations can be verified.
That makes the verifier the ultimate bottleneck, and the other (fast, expensive -- like an NHRA drag car) pipeline becomes vestigial since it can't be trusted anyway.
moffkalast 1 days ago [-]
Well the point is that verification can run in parallel, so if you can verify at 500 Mhz and have twenty of these units, you can run the core at 10 GHz. Minus of course the fixed single instruction verification time penalty, which gets more and more negligible the more parallel you go. Of course there is lots of overhead in that too, like GPUs painfully show.
ssl-3 1 days ago [-]
Right.
So we have 20 verifiers running at 500MHz, and this stack of verifiers is trustworthy. It does reliably-good work.
We also have a single 10GHz CPU core, and this CPU core is not trustworthy. It does spotty work (hence the verifiers).
And both of these things (the stack of verifiers, the single CPU core) peak out at exactly the same computational speeds. (Because otherwise, the CPU's output can't be verified.)
Sounds great! Except I can get even better performance from this system by just skipping the 10GHz CPU core, and doing all the work on the verifiers instead.
("Even better"? Yep. Unlike that glitch-ass CPU core, the verifiers' output is trustworthy. And the verifiers accomplish this reliable work without that extra step of occasionally wasting clock cycles to get things wrong.
If we know what the right answer is, then we already know the right answer. We don't need to have Mr. Spaz compute it in parallel -- or at all.)
firefly2000 1 days ago [-]
If the workload were perfectly parallelizable, your claim would be true. However, if it has serial dependency chains, it is absolutely worth it to compute it quickly and unreliably and verify in parallel
magicalhippo 1 days ago [-]
This is exactly what speculative decoding for LLMs do, and it can yield a nice boost.
Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.
If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.
firefly2000 24 hours ago [-]
I have a question about what "validation" means exactly. Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft? Wondering if there is a better method that preserves the distribution of the main model.
magicalhippo 20 hours ago [-]
> Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft?
It does the generation as normal using the draft model, thus sampling from the draft model's distribution for a given prefix to get the next (speculated) token. But it then uses the draft model's distribution and the main model's distribution for the given prefix to probabilistically accept or reject the speculated token, in a way which guarantees the distribution used to sample each token is identical to that of the main model.
The paper has the details[1] in section 2.3.
The inspiration for the method was indeed speculative execution as found in CPUs.
Haha, well do you have a point there. I guess I had the P!=NP kind of verification in my head, where it's easy to check if something is right, but not as easy to compute the result. If one could make these verifiers on some kind of checksum basis or something it might still make sense, but I'm not sure if that's possible.
ant6n 23 hours ago [-]
You can verify in 100-way parallel and without dependence, but you can’t do it with general computation.
1 days ago [-]
hulitu 1 days ago [-]
> if you are happy to lose reliability.
The only problem here is that reliability is a statistical thing.
You might be lucky, you might not.
hnuser123456 1 days ago [-]
Side channel attacks don't stand a chance!
Avlin67 1 days ago [-]
you never had WHEA errors... or pll issue on cpu C state transition...
Tepix 1 days ago [-]
Both the RP2040 and the RP2350 are amazing value these days with most other electronics increasing in price. Plus you can run FUZIX on them for the UNIX feel.
sandreas 1 days ago [-]
Mmh... I think that the LicheeRV Nano has kind of more value to it.
Around 20 bucks for the Wifi variant. 1GHz, 256MB RAM, USB OTG, GPIO and full Linux support while drawing less than 1W without any power optimizations and even supports < 15$ 2.8" LCDs out of the box.
I think the ace up the sleeve is PIO; I've seen so many weird and wonderful use cases for the Pico/RP-chips enabled by this feature, that don't seem replicable on other $1-class microcontrollers.
sandreas 1 days ago [-]
Wow thanks, this is definetely something I have to investigate. Maybe the Sipeed Maix SDK provides something similar for the LicheeRV Nano.
I'm currently prototyping a tiny portable audio player[1] which battery life could benefit a lot from this.
I'd rather have the Linux SOC and a $0.50-$1 FPGA (Renesas ForgeFPGA, Gowin, Efinix, whatever) nearby.
rasz 1 days ago [-]
> $0.50-$1 FPGA
no such thing, 5V tolerant buffers will run you more than that
addaon 1 days ago [-]
The ICE40s start well under $2 even in moderate quantities. They’re 3V3, not 5V0, but for most applications these days that’s an advantage.
RetroTechie 1 days ago [-]
Amazing value indeed!
That said: it's a bit sad there's so little (if anything) in the space between microcontrollers & feature-packed Linux capable SoC's.
I mean: these days a multi-core, 64 bit CPU & a few GB's of RAM seems to be the absolute minimum for smartphones, tablets etc, let alone desktop style work. But remember ~y2k masses of people were using single core, sub-1GHz CPU's with a few hundred MB RAM or less. And running full-featured GUI's, Quake1/2/3 & co, web surfing etc etc on that. GUI's have been done on sub-1MB RAM machines once.
Microcontrollers otoh seem to top out on ~512KB RAM. I for one would love a part with integrated:
# Multi-core, but 32 bit CPU. 8+ cores cost 'nothing' in this context.
# Say, 8 MB+ RAM (up to a couple hundred MB)
# Simple 2D graphics, maybe a blitter, some sound hw etc
# A few options for display output. Like, DisplayPort & VGA.
Read: relative low-complexity, but with the speed & power efficient integration of modern IC's. The RP2350pc goes in this direction, but just isn't (quite) there.
vardump 23 hours ago [-]
IIRC, you can use up to 16 MB of PSRAM with RP2350. Maybe up to 32 MB, not sure.
Many dev boards provide 8 MB PSRAM.
alnwlsn 1 days ago [-]
You might like the ESP32-P4
moffkalast 1 days ago [-]
Eh it's really not when you consider that the ESP32 exists. it has PCNT units for encoders, RMT LED drivers, 18 ADC channels instead of four, ULP coprocessor and various low power modes, not to mention wifi integrated into the SoC itself, not optional on the carrier board. And it's like half the price on top of all that. It's not even close.
The PIO units on the RP2040 are... overrated. Very hard to configure, badly documented and there's only 8 total. WS2812 control from the Pico is unreliable at best in my experience.
vardump 23 hours ago [-]
They are just different tools; both have their uses. I wouldn't really put either above the other by default.
> And it's like half the price on top of all that. It's not even close.
A reel of 3,400 RP2350 units costs $0.80 each, while a single unit is $1.10. The RP2040 is $0.70 each in a similar size reel. Are you sure about your figures, or are you perhaps comparing development boards rather than SoCs? If you’re certain, could I have a reference for ESP32s being sold at $0.35 each (or single quantities at $0.55)?
PIO units may be tricky to configure, but they're incredibly versatile. If you aren't comfortable writing PIO code yourself, you can always rely on third-party libraries. Driving HDMI? Check. Supporting an obscure, 40-year-old protocol that nothing else handles? Check. The possibilities are endless.
I find it hard to believe the RP2040 would have any issues driving WS2812s, provided everything is correctly designed and configured. Do you have any references for that?
tliltocatl 6 hours ago [-]
> wifi integrated into the SoC
I really wish we would stop sticking wireless in every device. The spectrum is limited and the security concerns are just not worth it. And if you try to sell it, certifying will be RPITA even in US (rightfully so!). Just had to redesign a little Modbus RTU sensor prototype for mass production, noticed the old version used BT MCU. So I immediately imagined the certification nightmare - and the sensor is deployed underwater, it's not like BT will be useful anyway. Why? Quote "but how do we update firmware without a wireless connection"… How do you update firmware on a device with RS-485 out, a puzzle indeed. In all fairness, the person who did it was by no means a professional programmer and wasn't supposed to know. But conditioning beginners to wireless on everything - that's just evil. /rant
antirez 1 days ago [-]
What I love of the Pico overclock story is that, sure, not at 870Mhz, but otherwise you basically give for granted that at 300Mhz and without any cooling it is rock solid, and many units at 400Mhz too.
amluto 1 days ago [-]
It’s amusing to contemplate energy per cycle as one clocks higher and higher — the usual formula has the energy per cycle scaling roughly as voltage squared.
I recently turned turbo off on a small, lightly loaded Intel server. This reduced power by about a factor of 2, core temperature by 30-40C, and allowed running the fans much quieter. I’m baffled as to why the CPU didn’t do this on its own. (Apple gets these details right. Intel, not so much.)
fluoridation 1 days ago [-]
It reduced the temperature by 30°? So it originally was "lightly loaded" and running at 60-70° C?
amluto 1 days ago [-]
More like 80-90 before and around 50 afterward.
This is a boring NVR workload with a bit of GPU usage, with total system utilization around 10% with turbo off. Apparently the default behavior is to turbo from the normal ~3GHz up to 5.4GHz, and I don’t know why the results were quite so poor.
This is an i9-13900H (Minisforum MS-01) machine, so maybe it has some weird tuning for gaming workloads? Still seems a bit pathetic. I have not tried monitoring the voltages with turbo on and off to understand exactly why it’s performing quite so inefficiently.
fluoridation 1 days ago [-]
If it was running at over 80° C it was not lightly loaded, it was pegging one or two cores and raising the clocks as high as they could go. That's what gives the best instantaneous performance. It's possible in your particular case that doesn't give the best instructions/J because you're limited by real world constraints (such as the capture rate of the camera), but the data comes fast enough to not give the CPU time to switch back down to a lower power state. Or it's also possible that the CPU did manage to reach a lower power state, but the dinky cooling solution was not able to make up for the difference. I'd monitor the power usage at each setting.
13 hours ago [-]
whiskers 1 days ago [-]
Haha — this was a fun day! It's honestly surprising how robust the RP2350 was under such extreme experimentation. Mike's write-up walks through pushing the core voltages far beyond stock limits and dry-ice cooling to see what the silicon could handle.
Credit where it's due: Mike is a wizard. He's been involved in some of our more adventurous tinkering, and his input on the more complex areas of our product software has been invaluable. Check out his GitHub for some really interesting projects: https://github.com/MichaelBell
remembering pushing i7 920 on dry ice with acetone by the time... also voltmod nforce 2 chipset to cranck bus clock for opteron 144. So cool !
nottorp 1 days ago [-]
Well, hope no one tries to deploy overlocked Raspberry Pi hardware in production... especially for kiosk style applications where they're in a metal box in the sun.
They're unstable enough at stock if taken outside an air conditioned room.
crest 1 days ago [-]
The post is about a microcontroller that sips a fraction of a Watt under sane conditions. Cooling its CPU cores is not a problem for real-world applications. You have to bypass the internal voltage regulator crank up the voltage even more before heat becomes an issue.
whiskers 1 days ago [-]
This is about the Raspberry Pi Pico 2 (based on the RP2350), not the original Raspberry Pi.
nottorp 1 days ago [-]
And is it better with bad cooling?
1 days ago [-]
aaronmdjones 1 days ago [-]
It's better with absolutely no cooling. It doesn't even consume (and thus dissipate) 100mW flat-out.
nottorp 2 hours ago [-]
Maybe they should have branded it differently …
whiskers 2 hours ago [-]
They did, it's the Raspberry Pi Pico (as opposed to the Raspberry Pi) as a dev board or the RP2350 (as opposed to the BCMXXXX) as a chip.
whiskers 1 days ago [-]
Yes.
Rendered at 22:41:39 GMT+0000 (Coordinated Universal Time) with Vercel.
It wouldn't be surprising if the RP2350 gets officially certified to run at something above the max supported clock at launch (150MHz), though obviously nothing close to 800MHz. That happened to the RP2040[1], which at launch nominally supported 133MHz but now it's up to 200MHz (the SDK still defaults to 125MHz for compatibility, but getting 200MHz is as simple as toggling a config flag[2]).
[1] https://www.tomshardware.com/raspberry-pi/the-raspberry-pi-p...
[2] https://github.com/raspberrypi/pico-sdk/releases/tag/2.1.1
Here is an idea for a CPU designer...
Observe that you can get way more performance (increased clock speed) or more performance per watt (lower core voltage) if you are happy to lose reliability.
Also observe that many CPU's do superscalar out of order execution, which requires having the ability to backtrack, and this is normally implemented with a queue and a 'commit' phase.
Finally, observe that verifying this commit queue is a fully parallel operation, and therefore can be checked slower and in a more power efficient way.
So, here's the idea. You run a blazing fast superscalar CPU, well past the safe clock speed limits that makes hundreds of computation or flow control mistakes per second. You have slow but parallel verification circuitry to verify the execution trace. Whenever a mistake is made, you put a pipeline bubble in the main CPU, clear the commit queue, you put in the correct result from the verification system, and continue - just like you would with a branch misprediction.
This happening a few hundred times per second will have a negligible impact on performance. (consider 100 cycles 'reset' penalty, 100*100 is a tiny fraction of 4Ghz)
The main fast CPU could also make deliberate mistakes - for example assuming floats aren't NaN, assuming division won't be by zero, etc. Trimming off rarely used logic makes the core smaller, making it easier to make it even faster or more power efficient (since wire length determines power consumption per bit).
https://www.usenix.org/system/files/1309_14-17_mickens.pdf
If so, then:
That seems like it would slow the ultimate computation to no more than rate rate at which they can be these computations can be verified.
That makes the verifier the ultimate bottleneck, and the other (fast, expensive -- like an NHRA drag car) pipeline becomes vestigial since it can't be trusted anyway.
So we have 20 verifiers running at 500MHz, and this stack of verifiers is trustworthy. It does reliably-good work.
We also have a single 10GHz CPU core, and this CPU core is not trustworthy. It does spotty work (hence the verifiers).
And both of these things (the stack of verifiers, the single CPU core) peak out at exactly the same computational speeds. (Because otherwise, the CPU's output can't be verified.)
Sounds great! Except I can get even better performance from this system by just skipping the 10GHz CPU core, and doing all the work on the verifiers instead.
("Even better"? Yep. Unlike that glitch-ass CPU core, the verifiers' output is trustworthy. And the verifiers accomplish this reliable work without that extra step of occasionally wasting clock cycles to get things wrong.
If we know what the right answer is, then we already know the right answer. We don't need to have Mr. Spaz compute it in parallel -- or at all.)
Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.
If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.
It does the generation as normal using the draft model, thus sampling from the draft model's distribution for a given prefix to get the next (speculated) token. But it then uses the draft model's distribution and the main model's distribution for the given prefix to probabilistically accept or reject the speculated token, in a way which guarantees the distribution used to sample each token is identical to that of the main model.
The paper has the details[1] in section 2.3.
The inspiration for the method was indeed speculative execution as found in CPUs.
[1]: https://arxiv.org/abs/2211.17192 Fast Inference from Transformers via Speculative Decoding
The only problem here is that reliability is a statistical thing. You might be lucky, you might not.
Around 20 bucks for the Wifi variant. 1GHz, 256MB RAM, USB OTG, GPIO and full Linux support while drawing less than 1W without any power optimizations and even supports < 15$ 2.8" LCDs out of the box.
And Rust can be compiled to be used with it...
https://github.com/scpcom/LicheeSG-Nano-Build/
Take a look at the `best-practise.md`.
It is also the base board of NanoKVM[1]
1: https://github.com/sipeed/NanoKVM
I'm currently prototyping a tiny portable audio player[1] which battery life could benefit a lot from this.
1: https://github.com/sandreas/rust-slint-riscv64-musl-demo
no such thing, 5V tolerant buffers will run you more than that
That said: it's a bit sad there's so little (if anything) in the space between microcontrollers & feature-packed Linux capable SoC's.
I mean: these days a multi-core, 64 bit CPU & a few GB's of RAM seems to be the absolute minimum for smartphones, tablets etc, let alone desktop style work. But remember ~y2k masses of people were using single core, sub-1GHz CPU's with a few hundred MB RAM or less. And running full-featured GUI's, Quake1/2/3 & co, web surfing etc etc on that. GUI's have been done on sub-1MB RAM machines once.
Microcontrollers otoh seem to top out on ~512KB RAM. I for one would love a part with integrated: # Multi-core, but 32 bit CPU. 8+ cores cost 'nothing' in this context. # Say, 8 MB+ RAM (up to a couple hundred MB) # Simple 2D graphics, maybe a blitter, some sound hw etc # A few options for display output. Like, DisplayPort & VGA.
Read: relative low-complexity, but with the speed & power efficient integration of modern IC's. The RP2350pc goes in this direction, but just isn't (quite) there.
Many dev boards provide 8 MB PSRAM.
The PIO units on the RP2040 are... overrated. Very hard to configure, badly documented and there's only 8 total. WS2812 control from the Pico is unreliable at best in my experience.
> And it's like half the price on top of all that. It's not even close.
A reel of 3,400 RP2350 units costs $0.80 each, while a single unit is $1.10. The RP2040 is $0.70 each in a similar size reel. Are you sure about your figures, or are you perhaps comparing development boards rather than SoCs? If you’re certain, could I have a reference for ESP32s being sold at $0.35 each (or single quantities at $0.55)?
PIO units may be tricky to configure, but they're incredibly versatile. If you aren't comfortable writing PIO code yourself, you can always rely on third-party libraries. Driving HDMI? Check. Supporting an obscure, 40-year-old protocol that nothing else handles? Check. The possibilities are endless.
I find it hard to believe the RP2040 would have any issues driving WS2812s, provided everything is correctly designed and configured. Do you have any references for that?
I really wish we would stop sticking wireless in every device. The spectrum is limited and the security concerns are just not worth it. And if you try to sell it, certifying will be RPITA even in US (rightfully so!). Just had to redesign a little Modbus RTU sensor prototype for mass production, noticed the old version used BT MCU. So I immediately imagined the certification nightmare - and the sensor is deployed underwater, it's not like BT will be useful anyway. Why? Quote "but how do we update firmware without a wireless connection"… How do you update firmware on a device with RS-485 out, a puzzle indeed. In all fairness, the person who did it was by no means a professional programmer and wasn't supposed to know. But conditioning beginners to wireless on everything - that's just evil. /rant
I recently turned turbo off on a small, lightly loaded Intel server. This reduced power by about a factor of 2, core temperature by 30-40C, and allowed running the fans much quieter. I’m baffled as to why the CPU didn’t do this on its own. (Apple gets these details right. Intel, not so much.)
This is a boring NVR workload with a bit of GPU usage, with total system utilization around 10% with turbo off. Apparently the default behavior is to turbo from the normal ~3GHz up to 5.4GHz, and I don’t know why the results were quite so poor.
This is an i9-13900H (Minisforum MS-01) machine, so maybe it has some weird tuning for gaming workloads? Still seems a bit pathetic. I have not tried monitoring the voltages with turbo on and off to understand exactly why it’s performing quite so inefficiently.
Credit where it's due: Mike is a wizard. He's been involved in some of our more adventurous tinkering, and his input on the more complex areas of our product software has been invaluable. Check out his GitHub for some really interesting projects: https://github.com/MichaelBell
Blatant plug: We have a wide range of boards based on the RP2350 for all sorts of projects! https://shop.pimoroni.com/collections/pico :-)
I bet if you designed a custom board it could do a little better
Eventually it will be seen as a feature.
They're unstable enough at stock if taken outside an air conditioned room.