Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Perceptual Image Codec: What Matters in Practical Learned Image Compression (apple.github.io)

122 points by ksec 1 days ago | 40 comments

srean 14 hours ago [-]

There are a few comments on the unnatural/hallucinated look of the compressed images.

I don't like it, but this is sort of the expected behavior for 'AI' based compression and denoising.

Think of it as encoding followed by decoding where encoding reduces an image to a 'prompt' and decoding a generation of a new image from that prompt.

These techniques have become quite popular in astrophotography and it makes me uncomfortable.

Note though that this strategy is not that fundamentally different from the classic method. There the steps are called analysis followed by synthesis.

Analysis step detects the degree of presence of preferred basis functions (for example, discrete cosine basis), the undesirable components discarded or down-weighted. Then in the synthesis phase, the bases weighted with their detected or accentuated strengths are recombined to obtain the image.

What's different this time is the nature of the bases and the Bayesian prior they encode.

perching_aix 8 hours ago [-]

> Think of it as encoding followed by decoding where encoding reduces an image to a 'prompt' and decoding a generation of a new image from that prompt.

This is exactly the type of metaphor where the fact that it's a metaphor doesn't survive very long, while also not even being necessary in the first place. So on the contrary, I hope this thought doesn't spread.

> 'AI'

There's no reason for the apostrophes. It's an application of the field, named just that since 1956, 70 years ago.

In agreement with your concerns though.

srean 4 hours ago [-]

AI is too nebulous, too I'll defined to be used without quotes, almost serious.

Also the 'prompt' is not a textual prompt of course, but its vector embedded representation.

dahart 1 days ago [-]

Looks very cool assuming all the comparisons are correct & fair and there’s no major failure cases. Quick link to the HTML version of the paper to save you a couple of clicks: https://arxiv.org/html/2605.05148v1

Since this is by Apple, I’m certainly curious if this is aimed at becoming the new default format for Apple devices. What kind of effort does it take to do that, beyond getting the paper published?

On the PR summary page, the “speed” column should be labeled “time”. Time is lower-is-better, whereas speed means higher-is-better.

The BD rate column could also use a less cryptic label. (Though maybe the audience is paper reviewers and not me.) The paper itself doesn’t even write out what the BD acronym in “BD rate” stands for, but it seems like it would be fair and accurate and better to call the column maybe something like relative compressed size, and mention the exact metric in the caption — where there’s already an explanation of BD rate.

I’m somewhat confused by, and slightly skeptical TBH, of the device timings. Are they correct & fair? Why is the NN-only portion almost as fast on an iPhone 17 compared to a V100 when the V100 has 4x the FP throughput? Is it comparing apples to apples (ha!), and is the GPU implementation reasonable? The data suggests the GPU implementation is not saturating the GPU.

Also why are there several different GPU models? And why is V100 even used? V100 is four generations old and not even supported anymore.

CharlesW 50 minutes ago [-]

> Since this is by Apple, I’m certainly curious if this is aimed at becoming the new default format for Apple devices. What kind of effort does it take to do that, beyond getting the paper published?

It depends on the goals. Apple understands that proprietary media and container formats for media distribution are non-starters (ProRes remains "documented but not open" for authoring workflows), so it's likely that the strategy would be to establish and evolve some final-ish version of PICO via Alliance for Open Media. That path would take 3-5 years.

ksec 1 days ago [-]

>what the BD acronym in “BD rate” stands for,

Bjontegaard Delta-Rate (BD Rate) metric, proposed in 2001 by Gisle Bjontegaard, is a method for calculating the average difference between two rate-distortion (RD) curves.

It is extremely common in codec comparison, along with terms like PSNR, SSIM and VMAF ( which is newer and developed by Netflix so it tends to get explained a bit more )

>’m certainly curious if this is aimed at becoming the new default format for Apple devices.

I certainly hope not. Not unless it is deterministic and much much higher quality.

qarl 1 days ago [-]

> I certainly hope not. Not unless it is deterministic and much much higher quality.

You're not comparing fairly. The author is intentionally using low-res images to illustrate how the compression works. You should compare these to, say, a JPEG compression at the same resolution and same bitrate. I think you'll find that this technique is quite an improvement to the compressions you already know and love.

mrob 1 days ago [-]

JPEG has the great advantage that all JPEG artifacts look like JPEG artifacts. Newer codecs create artifacts that can be mistaken for part of the original image. That's a heavy price to pay for improved compression efficiency.

drfloyd51 16 hours ago [-]

You’ve already chosen to go lossy. You can’t trust any pixel in the image to be true.

croon 8 hours ago [-]

I think the hinted implication is that jpeg artifacts rarely look like something else. If these can, I think the distinction is relevant.

F3nd0 14 hours ago [-]

Unlike image bits, trust isn’t binary!

5 hours ago [-]

joefourier 22 hours ago [-]

> And why is V100 even used? V100 is four generations old and not even supported anymore.

It wouldn’t surprise me that due to bureaucratic processes, it’s still somehow the most readily available GPU for Apple researchers despite being almost 10 years old now. I recall even last year seeing V100s used by Microsoft researchers who weren’t working on LLMs.

kllrnohj 1 days ago [-]

> Why is the NN-only portion almost as fast on an iPhone 17 compared to a V100 when the V100 has 4x the FP throughput?

Might have some sequential section or a block size that struggles to fill a V100 or a large chunk of CPU-only work or any number of things like that.

klodolph 1 days ago [-]

Interesting, but when I look at the sweater in the second image, the knitting just looks completely lost in the PICO vesion. The knitting looks correct but soft in other codecs. In the PICO version, it looks just completely wrong to me. The yarn structure has been replaced with a bunch of fuzzy strips. Similar problem in the third picture.

I guess this is what happens when you chase after extremely low data rates but I’m not happy with the results.

crazygringo 1 days ago [-]

I think it's fascinating because it seems to be a completely different type of compression.

You can see it in the hair as well. It seems very clear that it is engaging in a kind of texture synthesis.

So it seems to be looking at an area, and capturing the textural quality. And then reproducing that, so the overall effect is the same, but individual fibers or fuzzy bits are randomly generated from scratch.

And so yes, if you zoom in enough, the knitting looks completely wrong because the regular geometric pattern of irregular yarn it is made of has been replaced by a completely irregular pattern of irregular yarn.

In other words, it is essentially hallucination of details on a micro scale but not on a macro scale.

And I think that raises a really interesting philosophical question of what we consider to be valid image reconstruction from lossy compression.

Because on the one hand, this is no different from blurriness or even the kind of blocky JPEG compression we are familiar with. It's just pixels that are wrong. Those blocks don't appear in the original image. The blurriness isn't there in the original image.

But on the other hand, we see blurriness as being somehow more "honest", and we are easily able to recognize that blockiness is an artifact.

Whereas with textural hallucination, it is no longer clear what is being filled in versus what is original, because it's doing such a good job of emulating so many aspects of the original texture.

And it's really hard to say if one approach is better or worse than the other. It's probably more accurate to say that one is more appropriate than the other in different contexts. Like if it is just a normal news photograph, I am perfectly happy with a sharper image because it's not changing anything substantial – it's not changing the face of a world leader or the number of people in the photo. But on the other hand, if I am doing online shopping for shirts and I want to be able to zoom in on the texture, then it's incredibly important that the texture be accurate and not loosely hallucinated.

srean 1 days ago [-]

This is a potential problem in "AI" denoising as well.

These denoising models, the autoencoders more directly so, work by (lossily) mapping the raw input to a very low dimensional representation. The other part generates the desired image back from the low-d representation.

The problem is that nothing, in the vanilla versions, prevent the the low-d version to be a semantics representation such as, Moon, dark hair etc and the generative part to take cues from the semantic representation to a generated sub-image.

The Samsung phone Moon image was likely a result of deliberate choice / company policy, but these things can happen without explicit intent.

thraway54321 1 days ago [-]

I disagree that it's only on a micro scale. If you look at the picture of the parrots it completely changes the black/white pattern in the face of the red parrot and if you look at the picture of the green bicycle where the luggage rack attaches close to the center of the rear wheel, it's completely mangled, in contrast to the more "blurry" picture where you can clearly see the bolts where it's attached also the rods going from the wheel hub up to the luggage rack also looks very jagged and weird whereas they look fine in the blurry one. There are certainly other errors as well but those where the most jarring I Noticed at a quick glance. I don't think a compression algorithm that does this poorly on cherry picked examples are going to fly when you start throwing real pictures at them. If you are going to screw with the ground truth I bet you could get better results by throwing the blurry pictures in one of those "AI" upscalers.

crazygringo 1 days ago [-]

I would say all of those examples you are picking are at the micro scale. Obviously it's a somewhat arbitrary division between macro and micro, what you consider to be the macro objects versus what you consider to be the micro details.

And this is also going to depend on the level of compression being chosen. Obviously, the greater the compression, the lesser the fidelity. The lesser the compression, the greater the fidelity.

joquarky 15 hours ago [-]

This is going to suck for whoever is going to have to explain this to juries.

crazygringo 1 hours ago [-]

Camera phones have been hallucinating texture and details for years now. This is nothing new, it's just now part of the compression layer as well.

And defense attorneys have been making arguments about the unreliability of all sorts of types of evidence for many centuries. So there is nothing new there either.

If someone's face is clearly visible and recognizable in a photo, this algorithm isn't changing their face to someone else's face.

Npovview 1 days ago [-]

I saw mentioned such artifacts when one video was reviewing DLSS from Nvidia.

kllrnohj 1 days ago [-]

I find it very curious that their new image codec did not really compare itself against other image codecs, but instead primarily video codecs pretending to do images. As in, no JPEG or JPEG-XL.

150ms to decode 12mp is also incredibly slow. That's like PNG territory of slow. A more flagship 50mp image would be... oof.

theandrewbailey 1 days ago [-]

> As in, no JPEG or JPEG-XL.

JPEG-XL is designed for archive-grade images. It hasn't been optimized (maybe not even designed?) for low bpp settings (less than 1 bit per pixel), and is awful below that, let alone 0.3 bpp or so. Plain old JPEG is much worse. Video codecs (and the image formats derived from them) have optimized for quality at low settings.

> 150ms to decode 12mp is also incredibly slow.

I think that's sufficiently fast. (Keep in mind that a 4k screen is about 8.5mp.) How fast do you want your slideshow to be?

kllrnohj 24 hours ago [-]

> I think that's sufficiently fast. (Keep in mind that a 4k screen is about 8.5mp.) How fast do you want your slideshow to be?

A modern iPhone can capture at up to 48MP. If the performance scales linearly with pixel count, that would put tapping on a thumbnail to the full size being ready at over half a second. That's going to feel laggy. Now you can throw storage at the problem and pre-compute a downscaled intermediate, sure, but that doesn't fix it when you send the photo to someone else or whatever.

And competitive phones are doing 200mp captures (which is stupid in its own right but phone manufacturers and doing stupid things, name a more iconic duo)

rescbr 15 hours ago [-]

At least JPEG contains downscaled thumbnails embedded into it as part of the EXIF stream. There's no need for the receiving device to rescale it again.

Pretty sure these newer formats do the same.

kllrnohj 7 hours ago [-]

That thumbnail is for grid-size, it's generally 50kb or less in size which ends up being around 300x240 or up to 500x500 on newer codecs.

It'll be visibly low-res and blurry until the full size decodes for full screen. Hence why I said "when tapping on a thumbnail"

jcelerier 1 days ago [-]

> How fast do you want your slideshow to be?

we're in 2026, 240hz screens are becoming common. Nothing in the end-user experience should take more than 3-4ms. My personal goal when developing is keeping things at at least 60FPS and ideally 120 when building the whole stack with ASAN / UBSAN / stdlib's debug modes.

For instance when looking at this the first thing I thought was to try to make an installation which permanently recurses the codec's application on itself at each frame, to give the impression of a constantly moving landscape. Impossible on a smaller machine if computing a single frame takes 150ms.

sdenton4 18 hours ago [-]

That's fine reasoning for video, but if someone is actually looking at a still picture for more than 1/240th of a second, the fine detail matters a bit more. These are different applications, with different sweet spots in the time/quality trade-off.

kg 23 hours ago [-]

Think about the scenarios where people are viewing slideshows. If you're on a mobile device, that 150ms spent decoding each image is time where the CPU and/or GPU of the mobile device are running at full tilt, draining the battery. Suddenly applications that would normally be fast and efficient like a photo gallery app become laggy and drain your battery. Not great.

Dwedit 20 hours ago [-]

I can see by dragging the slider around and comparing to the ground truth that it is making things up that look plausible to what is there.

Just keep any numbers out of there. Remember what happened with the Xerox scanners and JBIG2 compression where numbers got substituted with similar looking numbers.

ksec 1 days ago [-]

Some Notes.

According to Chrome Stats from Google 2019 [1], ~80% to 85% of images served are above bpp 1.0 ( Bit per Pixel ). Around ~95% are above bpp 0.5. I doubt this have changed if not gotten worse as images gets larger over the years.

There are images such as logo or specific patterns that works a lot better and compressed at low bpp ( below 0.5 ). But those are rare on web, and in the case here, most of the images are photography.

Which means at sub 0.3 bpp it is a ridiculously low bitrate even for Web photo.

HEIC on iPhone 17 is based on HEVC, H.265 hardware encoder on iPhone 17.

VVC / VTM is H.266 Codec, the successor of HEVC / H.265 most people may have heard of one of its encoder called x265.

ETM is H.267 Codec, currently the best in class video compression codec that is still in development.

I assume everyone knows about AV1 and AV2 already since we are on HN.

CPU Encoder, generally speaking produce better image quality while hardware encoder tends to be fast but lower quality. Both HEIC and AV1 are based on hardware encoder on the iPhone 17. ( At least that is my read of it )

In case anyone wondering. JPEG XL is not designed or yet to be optimised for low bpp. It excels at 0.8 bpp onwards depending of type of image. So the result of XL would likely be very bad. Similarly to normal JPEG and H.264 encoder.

I am wondering if this image codec is deterministic.

I would imagine if we use this on the web the actual image size would be a lot smaller. Meaning a lot of the artifacts we see shouldn't matter. And clicking on it would bring up a different image file.

There is finally a possibility in the future some half decent image could be included within 14K frame of the webpage.

Encoding and Decoding speed is actually useable on today's hardware. Decoding in sub 100ms on iPhone 17.

And on another notes, VVC is doing extremely well in terms of compression rate and encoding / decoding complexity.

AV2 launching by the end of this month.

[1] Figure 1

https://www.spiedigitallibrary.org/conference-proceedings-of...

warumdarum 1 days ago [-]

I always wondered wether a image wouldnt be best encoded in a sort of spline by color, the intensity beeing the curves and then those splines just overlayed and rendered with a thickness per segment.

Tommix11 15 hours ago [-]

What if someone gets convicted of a crime based on generated details in an image.

warumdarum 10 hours ago [-]

That definatly would cross a spline there

DiogenesKynikos 15 hours ago [-]

The PICO images do look more realistic at first glance, but when you zoom in, it turns out that PICO has changed all sorts of small details in the images.

For example, if you zoom in on the trees in the first image, PICO has moved branches around and invented new branches that didn't exist. It does so in a way that looks very realistic, but it is still altering reality. The JPEG failure mode is much different: it causes ringing artifacts that are obviously artificial, but it doesn't move a branch from location A to location B.

akersten 15 hours ago [-]

And the most critical thing about JPG is that the decoding is deterministic. Who's to say this fancy new PICO thing doesn't produce different pixels in a year when the algorithm improves, or the local model changes, etc.

Imo, generative AI and its derivatives should be completely shunned as image/video encoders. They are simply an inappropriate tool for the job. And I say that as an AIpilled token addict.

It would be like saying hey check out my amazing new text compression algorithm, 97x better than LZMA, then you look at the encoded file and it says "generate a romance story between two characters named Romeo and Juliet"

14 hours ago [-]

a-dub 1 days ago [-]

this is interesting. would be cool to explore something like integrating a vlm to add a "semantic" term to the loss function. looking through the comparisons, some of the baseline codecs create meaningfully different details (as could be described by text) in the images.

xyzsparetimexyz 19 hours ago [-]

So the decoding step is running the latent values through a model in order to decode it right? The artifacts are super offputting and are just going to get this labelled as slop. I would instead follow a neural texture compression model (e.g. RTXNTC) and generate a small model & latent values when compressing an image. It'll take longer, sure, but as long as it can work in less than 10 seconds on an iPhone I think theres a use case. As it is, idk

jcelerier 1 days ago [-]

would be great to have the weights somewhere

brcmthrowaway 1 days ago [-]

What would this be used for?

asxndu 17 hours ago [-]

[dead]

Rendered at 18:14:17 GMT+0000 (Coordinated Universal Time) with Vercel.