From patchwork Wed Oct 19 09:16:08 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: arm: vp9: Add NEON optimizations of VP9 MC functions X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 61752 Message-Id: To: libav development Date: Wed, 19 Oct 2016 12:16:08 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= List-Id: libav development On Tue, 11 Oct 2016, Martin Storsjö wrote: > This work is sponsored by, and copyright, Google. > > The filter coefficients are signed values, where the product of the > multiplication with one individual filter coefficient doesn't > overflow a 16 bit signed value (the largest filter coefficient is > 127). But when the products are accumulated, the resulting sum can > overflow the 16 bit signed range. Instead of accumulating in 32 bit, > we accumulate all filter taps but the largest one in one register, and > the largest one (either index 3 or 4) in a separate one, added with > saturation afterwards. > > (The VP8 MC asm does something similar, but slightly simpler, by > accumulating each half of the filter separately. In the VP9 MC > filters, each half of the filter can also overflow though, so the > largest component has to be handled individually.) > > Examples of relative speedup compared with the C version, from checkasm: > Cortex A7 A8 A9 A53 > vp9_avg4_neon: 1.63 1.19 1.37 1.54 > vp9_avg8_neon: 2.24 3.65 3.30 2.56 > vp9_avg16_neon: 2.64 6.72 2.92 2.80 > vp9_avg32_neon: 2.50 5.45 2.76 2.45 > vp9_avg64_neon: 2.69 5.81 2.72 2.79 > vp9_avg_8tap_smooth_4h_neon: 3.29 4.71 2.90 4.78 > vp9_avg_8tap_smooth_4hv_neon: 3.78 4.54 3.33 4.51 > vp9_avg_8tap_smooth_4v_neon: 5.18 6.52 4.29 5.50 > vp9_avg_8tap_smooth_8h_neon: 6.47 9.07 5.47 9.66 > vp9_avg_8tap_smooth_8hv_neon: 6.60 8.16 5.98 7.91 > vp9_avg_8tap_smooth_8v_neon: 9.25 12.67 8.07 9.61 > vp9_avg_8tap_smooth_64h_neon: 6.98 10.36 5.91 11.59 > vp9_avg_8tap_smooth_64hv_neon: 6.39 9.15 6.05 8.48 > vp9_avg_8tap_smooth_64v_neon: 10.64 14.12 9.39 11.05 > vp9_put4_neon: 1.30 1.15 0.89 1.35 > vp9_put8_neon: 1.28 2.07 1.80 1.62 > vp9_put16_neon: 1.64 4.08 1.71 1.93 > vp9_put32_neon: 1.53 3.68 2.12 1.68 > vp9_put64_neon: 2.01 3.98 1.91 1.91 > vp9_put_8tap_smooth_4h_neon: 3.05 4.47 2.68 4.53 > vp9_put_8tap_smooth_4hv_neon: 3.74 4.46 3.31 4.50 > vp9_put_8tap_smooth_4v_neon: 5.23 6.19 4.28 5.87 > vp9_put_8tap_smooth_8h_neon: 5.89 8.40 4.97 8.99 > vp9_put_8tap_smooth_8hv_neon: 6.84 8.05 5.90 7.91 > vp9_put_8tap_smooth_8v_neon: 9.41 11.97 7.99 10.49 > vp9_put_8tap_smooth_64h_neon: 6.61 9.79 5.06 11.35 > vp9_put_8tap_smooth_64hv_neon: 7.01 9.13 6.37 9.17 > vp9_put_8tap_smooth_64v_neon: 11.09 13.31 9.32 12.52 > > For the larger 8tap filters, the speedup vs C code is around 6-14x. > > This is significantly faster than libvpx's implementation of the same > functions, at least when comparing the put_8tap_smooth_64 functions > (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from > libvpx). > > Absolute runtimes from checkasm: > Cortex A7 A8 A9 A53 > vp9_put_8tap_smooth_64h_neon: 21229.8 14474.5 19790.1 10885.1 > libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7 > > vp9_put_8tap_smooth_64v_neon: 14966.6 12297.2 13786.5 11679.4 > libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2 > > Thus, on the A9, the horizontal filter is only marginally faster than > libvpx, while our version is significantly faster on the other cores, > and the vertical filter is significantly faster on all cores. The > difference is especially large on the A7. > > The libvpx implementation does the accumulation in 32 bit, which > probably explains most the differences. > --- > Since the previous version, I tuned the avg4 and put4 a bit further, > making avg4 faster than C on the A8, and improving put4 a little on A7 > and A53. put4 is still marginally slower than the C version on A9 though, > but I think it isn't worth the trouble to try to workaround it. > > Rewrapped some paragraphs in the commit message, that were unnecessarily > narrow. Changed the big benchmark table into one with relative speedups. > --- > libavcodec/arm/Makefile | 2 + > libavcodec/arm/vp9dsp_init_arm.c | 140 +++++++ > libavcodec/arm/vp9mc_neon.S | 787 +++++++++++++++++++++++++++++++++++++++ > libavcodec/vp9.h | 1 + > libavcodec/vp9dsp.c | 2 + > 5 files changed, 932 insertions(+) > create mode 100644 libavcodec/arm/vp9dsp_init_arm.c > create mode 100644 libavcodec/arm/vp9mc_neon.S > > diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile > index bd4dd4e..2638230 100644 > --- a/libavcodec/arm/Makefile > +++ b/libavcodec/arm/Makefile > @@ -45,6 +45,7 @@ OBJS-$(CONFIG_MLP_DECODER) += arm/mlpdsp_init_arm.o > OBJS-$(CONFIG_RV40_DECODER) += arm/rv40dsp_init_arm.o > OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_init_arm.o > OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_init_arm.o > +OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_arm.o > > > # ARMv5 optimizations > @@ -138,3 +139,4 @@ NEON-OBJS-$(CONFIG_RV40_DECODER) += arm/rv34dsp_neon.o \ > arm/rv40dsp_neon.o > NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o > NEON-OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_neon.o > +NEON-OBJS-$(CONFIG_VP9_DECODER) += arm/vp9mc_neon.o > diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c > new file mode 100644 > index 0000000..db8c683 > --- /dev/null > +++ b/libavcodec/arm/vp9dsp_init_arm.c > @@ -0,0 +1,140 @@ > +/* > + * Copyright (c) 2016 Google Inc. > + * > + * This file is part of Libav. > + * > + * Libav is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * Libav is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with Libav; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include > + > +#include "libavutil/attributes.h" > +#include "libavutil/arm/cpu.h" > +#include "libavcodec/vp9.h" > + > +#define declare_fpel(type, sz) \ > +void ff_vp9_##type##sz##_neon(uint8_t *dst, ptrdiff_t dst_stride, \ > + const uint8_t *src, ptrdiff_t src_stride, \ > + int h, int mx, int my) > + > +#define declare_copy_avg(sz) \ > + declare_fpel(copy, sz); \ > + declare_fpel(avg , sz) > + > +#define decl_mc_func(op, filter, dir, sz) \ > +void ff_vp9_##op##_##filter##sz##_##dir##_neon(uint8_t *dst, ptrdiff_t dst_stride, \ > + const uint8_t *src, ptrdiff_t src_stride, \ > + int h, int mx, int my) > + > +#define define_8tap_2d_fn(op, filter, sz) \ > +static void op##_##filter##sz##_hv_neon(uint8_t *dst, ptrdiff_t dst_stride, \ > + const uint8_t *src, ptrdiff_t src_stride, \ > + int h, int mx, int my) \ > +{ \ > + LOCAL_ALIGNED_16(uint8_t, temp, [72 * 64]); \ > + /* We only need h + 7 lines, but the horizontal filter assumes an \ > + * even number of rows, so filter h + 8 lines here. */ \ > + ff_vp9_put_##filter##sz##_h_neon(temp, 64, \ > + src - 3 * src_stride, src_stride, \ > + h + 8, mx, 0); \ > + ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride, \ > + temp + 3 * 64, 64, \ > + h, 0, my); \ > +} Since this reads one more line of input than necessary (h + 8 instead of h + 7), I had to squash in this change locally: Nothing else is needed, since that part of the buffer doesn't need to be initialized; it just needs to be allocated and not out of bounds. (In the hv filter, the last uninitialized line is filtered horizontally when the filter does 2 lines at a time, but the vertical half doesn't read it.) // Martin diff --git a/libavcodec/vp9.h b/libavcodec/vp9.h index 84bed6d..ba622b1 100644 --- a/libavcodec/vp9.h +++ b/libavcodec/vp9.h @@ -420,7 +420,7 @@ typedef struct VP9Context { // whole-frame cache uint8_t *intra_pred_data[3]; VP9Filter *lflvl; - DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[71 * 80]; + DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[72 * 80]; // block reconstruction intermediates int16_t *block_base, *block, *uvblock_base[2], *uvblock[2];