News The biggest speedup I've seen so far' — FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

Reminds me of the Voxel Space engine used in circa 1993 for the original Comanche PC game that was entirely written in Assembly as it needed to perform well even without GPU acceleration.
 
The article said:
Last November, we reported on an FFmpeg performance boost that could speed certain operations by up to 94x.
That speedup was reproducible only if you compiled the totally unoptimized, generic C implementation in debug mode.

When I tried compiling it in release mode and using clang instead of gcc, I got over 50% as fast as the hand-written assembly, without any changes to the generic C sources. Upon reading the sources, it's clear that the C could've been written more optimally, likely yielding further improvements - and I'm not even talking about using any AVX512 intrinsics!

I will be taking a look at this latest patch, when I have a chance.

P.S. thanks for actually linking the patch, this time. Last time, the patch wasn't in ffmpeg, but rather someone posted a slide from a DAV1D presentation on the ffmpeg Twitter account. Took me a while to figure that out.


Edit: I just leaned that someone looked into that previous 94x speedup even deeper than I did! Apparently, the SIMD code implemented a 6-tap convolution, whereas the generic C version implemented 8-tap. I'll bet that accounts for a lot of the difference clang couldn't close.
Furthermore, they actually reached the original author and got an admission that the comparison was made using C code compiled with optimizations disabled!
 
Last edited:
  • Like
Reactions: P.Amini
There's still a lot to be said for hand optimised machine language code.
Oh, but they never even tried using C with intrinsics. Whenever they optimize something, they go straight to assembly. So, we don't even know how well-optimized C compares.

In the last thread, someone claimed compilers wouldn't be smart enough to fuse two separate operations into a single AVX-512 instruction, which I subsequently demonstrated clang/LLVM doing. I've been quite impressed by its autovectorization. It won't restructure your code to be vector-friendly, but it seems to do a good job of more straight-forward vectorization tasks.

If I get a chance to fiddle with this patch, I'll post my findings here.
 
Reminds me of the Voxel Space engine used in circa 1993 for the original Comanche PC game that was entirely written in Assembly as it needed to perform well even without GPU acceleration.
In 1993, compilers weren't nearly as sophisticated and the first PC 3D graphics accelerator cards didn't yet exist. Nvidia's NV1 didn't launch until May, 1995. Cards based on 3D Labs' Glint 300SX also showed up about the same time.

BTW, I'm sure plenty of other 3D games from that time used assembly language. It wouldn't surprise me to learn that Wolfenstein and Doom both did. Quake was worked on by the author of the book Zen of Assembly Language. I read his columns in Dr. Dobbs Journal, back in the day, and still have a copy of his book Zen of Code Optimization floating around, somewhere.
 
  • Like
Reactions: DS426
In 1993, compilers weren't nearly as sophisticated and the first PC 3D graphics accelerator cards didn't yet exist. Nvidia's NV1 didn't launch until May, 1995. Cards based on 3D Labs' Glint 300SX also showed up about the same time.

BTW, I'm sure plenty of other 3D games from that time used assembly language. It wouldn't surprise me to learn that Wolfenstein and Doom both did. Quake was worked on by the author of the book Zen of Assembly Language. I read his columns in Dr. Dobbs Journal, back in the day, and still have a copy of his book Zen of Code Optimization floating around, somewhere.
Great point! It sounds wild today but probably wasn't very rare back then. As for graphics, yep, that design decision makes complete sense (I would say not even a decision as I don't think there was another feasible option) given the goings-on of that time. Even 3dfx' Voodoo accelerator didn't come out commercially until 1996.
 
  • Like
Reactions: bit_user
The writer seems to use 100% and 100x interchangeably. They are not. A 100% uplift is twice the performance, a 100x uplift is 100 times the performance
Yeah, I noticed that as well. At least the headline matches what the developer claimed. Not my biggest issue with the article, though.

I'm trying to put myself in the headspace of an author or a reader who knows little or nothing about software development or code optimization. There are so many questions that should come up, like: whether such performance gains are just waiting to be had in any piece of software written in C? Wouldn't that seem like a reasonable question to ask, because it'd save everyone having to upgrade their CPUs, if true! And yet, why doesn't the author pursue that angle? And if not, what's so special about this code that makes the compiler so horrible at optimizing it? It just strikes me as a rather incurious article.

The only thing more confounding than that is the question of why the developer (if they actually believed these numbers), didn't apparently deem the C code in need of any reworking, since ffmpeg is a cross-platform codebase and doesn't have assembly language versions of every function for every supported CPU ISA. So, the C version should at least be made decently fast. And if it's slower by more than an order of magnitude, it should be obvious something is really wrong with it, because an order of magnitude is usually about what you'd get from vectorizing something like this.
 
I noticed this patch still hadn't been merged. A quick look at the linked mailing list thread shows why:

The developer (Niklas Haas) said:
Upon further testing, I realized that this logic (both C and SIMD) overflows
for 16-bit inputs. Will fix and resubmit.

Source: https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346727.html

He goes on to say:

I also found that the C versions can be made slightly faster by returning out
of the inner loop, which generates a shorter scalar version that is faster
than the auto-vectorized abomination that was generated before.

So, that also suggests why the generic C version performed quite so poorly.

He iterated on these patches two more times, so far, yet it still hasn't been merged. Here's the latest patch, featuring a speedup of 55x over generic C*:

* I've learned that ffmpeg doesn't enable compiler autovectorization, by default. So, this is comparing AVX-512 (processing up to 64 bytes at a time) vs. serial code that processes only one byte per loop iteration. Considering that, a speedup of 55x actually makes sense. Once his patch is finally merged, I'll probably pull the tree and see what sort of performance it gets vs. autovectorized SSE2 and AVX-512 via both gcc and clang.


P.S. I think this highlights the risks of reporting on patches before they've even been merged. Also, the article's author probably could've seen the first reply I quoted from, above, given it was sent less than a hour after the one cited in the article. While it might be somewhat rare for a developer to send such a retraction, the whole purpose of sending patches to such mailing lists is to solicit feedback from other list members (thus leading to fixes & further iteration), which is quite typical. Anything that's submitted, especially in its first draft, should be considered nothing more than a work-in-progress.

So, I'd say the article author (Mark Tyson) deserves an extra demerit for failing to look at all the messages in that mailing list thread.
 
Last edited: