From patchwork Wed Oct 19 09:32:37 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: arm: vp9: Add NEON optimizations of VP9 MC functions X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 61753 Message-Id: To: libav development Date: Wed, 19 Oct 2016 12:32:37 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= List-Id: libav development On Wed, 19 Oct 2016, Martin Storsjö wrote: > On Tue, 11 Oct 2016, Martin Storsjö wrote: > >> This work is sponsored by, and copyright, Google. >> >> The filter coefficients are signed values, where the product of the >> multiplication with one individual filter coefficient doesn't >> overflow a 16 bit signed value (the largest filter coefficient is >> 127). But when the products are accumulated, the resulting sum can >> overflow the 16 bit signed range. Instead of accumulating in 32 bit, >> we accumulate all filter taps but the largest one in one register, and >> the largest one (either index 3 or 4) in a separate one, added with >> saturation afterwards. >> >> (The VP8 MC asm does something similar, but slightly simpler, by >> accumulating each half of the filter separately. In the VP9 MC >> filters, each half of the filter can also overflow though, so the >> largest component has to be handled individually.) >> >> Examples of relative speedup compared with the C version, from checkasm: >> Cortex A7 A8 A9 A53 >> vp9_avg4_neon: 1.63 1.19 1.37 1.54 >> vp9_avg8_neon: 2.24 3.65 3.30 2.56 >> vp9_avg16_neon: 2.64 6.72 2.92 2.80 >> vp9_avg32_neon: 2.50 5.45 2.76 2.45 >> vp9_avg64_neon: 2.69 5.81 2.72 2.79 >> vp9_avg_8tap_smooth_4h_neon: 3.29 4.71 2.90 4.78 >> vp9_avg_8tap_smooth_4hv_neon: 3.78 4.54 3.33 4.51 >> vp9_avg_8tap_smooth_4v_neon: 5.18 6.52 4.29 5.50 >> vp9_avg_8tap_smooth_8h_neon: 6.47 9.07 5.47 9.66 >> vp9_avg_8tap_smooth_8hv_neon: 6.60 8.16 5.98 7.91 >> vp9_avg_8tap_smooth_8v_neon: 9.25 12.67 8.07 9.61 >> vp9_avg_8tap_smooth_64h_neon: 6.98 10.36 5.91 11.59 >> vp9_avg_8tap_smooth_64hv_neon: 6.39 9.15 6.05 8.48 >> vp9_avg_8tap_smooth_64v_neon: 10.64 14.12 9.39 11.05 >> vp9_put4_neon: 1.30 1.15 0.89 1.35 >> vp9_put8_neon: 1.28 2.07 1.80 1.62 >> vp9_put16_neon: 1.64 4.08 1.71 1.93 >> vp9_put32_neon: 1.53 3.68 2.12 1.68 >> vp9_put64_neon: 2.01 3.98 1.91 1.91 >> vp9_put_8tap_smooth_4h_neon: 3.05 4.47 2.68 4.53 >> vp9_put_8tap_smooth_4hv_neon: 3.74 4.46 3.31 4.50 >> vp9_put_8tap_smooth_4v_neon: 5.23 6.19 4.28 5.87 >> vp9_put_8tap_smooth_8h_neon: 5.89 8.40 4.97 8.99 >> vp9_put_8tap_smooth_8hv_neon: 6.84 8.05 5.90 7.91 >> vp9_put_8tap_smooth_8v_neon: 9.41 11.97 7.99 10.49 >> vp9_put_8tap_smooth_64h_neon: 6.61 9.79 5.06 11.35 >> vp9_put_8tap_smooth_64hv_neon: 7.01 9.13 6.37 9.17 >> vp9_put_8tap_smooth_64v_neon: 11.09 13.31 9.32 12.52 >> >> For the larger 8tap filters, the speedup vs C code is around 6-14x. >> >> This is significantly faster than libvpx's implementation of the same >> functions, at least when comparing the put_8tap_smooth_64 functions >> (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from >> libvpx). >> >> Absolute runtimes from checkasm: >> Cortex A7 A8 A9 A53 >> vp9_put_8tap_smooth_64h_neon: 21229.8 14474.5 19790.1 10885.1 >> libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7 >> >> vp9_put_8tap_smooth_64v_neon: 14966.6 12297.2 13786.5 11679.4 >> libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2 >> >> Thus, on the A9, the horizontal filter is only marginally faster than >> libvpx, while our version is significantly faster on the other cores, >> and the vertical filter is significantly faster on all cores. The >> difference is especially large on the A7. >> >> The libvpx implementation does the accumulation in 32 bit, which >> probably explains most the differences. >> --- >> Since the previous version, I tuned the avg4 and put4 a bit further, >> making avg4 faster than C on the A8, and improving put4 a little on A7 >> and A53. put4 is still marginally slower than the C version on A9 though, >> but I think it isn't worth the trouble to try to workaround it. >> >> Rewrapped some paragraphs in the commit message, that were unnecessarily >> narrow. Changed the big benchmark table into one with relative speedups. >> --- >> libavcodec/arm/Makefile | 2 + >> libavcodec/arm/vp9dsp_init_arm.c | 140 +++++++ >> libavcodec/arm/vp9mc_neon.S | 787 >> +++++++++++++++++++++++++++++++++++++++ >> libavcodec/vp9.h | 1 + >> libavcodec/vp9dsp.c | 2 + >> 5 files changed, 932 insertions(+) >> create mode 100644 libavcodec/arm/vp9dsp_init_arm.c >> create mode 100644 libavcodec/arm/vp9mc_neon.S >> >> diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile >> index bd4dd4e..2638230 100644 >> --- a/libavcodec/arm/Makefile >> +++ b/libavcodec/arm/Makefile >> @@ -45,6 +45,7 @@ OBJS-$(CONFIG_MLP_DECODER) += >> arm/mlpdsp_init_arm.o >> OBJS-$(CONFIG_RV40_DECODER) += arm/rv40dsp_init_arm.o >> OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_init_arm.o >> OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_init_arm.o >> +OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_arm.o >> >> >> # ARMv5 optimizations >> @@ -138,3 +139,4 @@ NEON-OBJS-$(CONFIG_RV40_DECODER) += >> arm/rv34dsp_neon.o \ >> arm/rv40dsp_neon.o >> NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o >> NEON-OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_neon.o >> +NEON-OBJS-$(CONFIG_VP9_DECODER) += arm/vp9mc_neon.o >> diff --git a/libavcodec/arm/vp9dsp_init_arm.c >> b/libavcodec/arm/vp9dsp_init_arm.c >> new file mode 100644 >> index 0000000..db8c683 >> --- /dev/null >> +++ b/libavcodec/arm/vp9dsp_init_arm.c >> @@ -0,0 +1,140 @@ >> +/* >> + * Copyright (c) 2016 Google Inc. >> + * >> + * This file is part of Libav. >> + * >> + * Libav is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU Lesser General Public >> + * License as published by the Free Software Foundation; either >> + * version 2.1 of the License, or (at your option) any later version. >> + * >> + * Libav is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * Lesser General Public License for more details. >> + * >> + * You should have received a copy of the GNU Lesser General Public >> + * License along with Libav; if not, write to the Free Software >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA >> 02110-1301 USA >> + */ >> + >> +#include >> + >> +#include "libavutil/attributes.h" >> +#include "libavutil/arm/cpu.h" >> +#include "libavcodec/vp9.h" >> + >> +#define declare_fpel(type, sz) \ >> +void ff_vp9_##type##sz##_neon(uint8_t *dst, ptrdiff_t dst_stride, \ >> + const uint8_t *src, ptrdiff_t src_stride, \ >> + int h, int mx, int my) >> + >> +#define declare_copy_avg(sz) \ >> + declare_fpel(copy, sz); \ >> + declare_fpel(avg , sz) >> + >> +#define decl_mc_func(op, filter, dir, sz) >> \ >> +void ff_vp9_##op##_##filter##sz##_##dir##_neon(uint8_t *dst, ptrdiff_t >> dst_stride, \ >> + const uint8_t *src, >> ptrdiff_t src_stride, \ >> + int h, int mx, int my) >> + >> +#define define_8tap_2d_fn(op, filter, sz) >> \ >> +static void op##_##filter##sz##_hv_neon(uint8_t *dst, ptrdiff_t >> dst_stride, \ >> + const uint8_t *src, ptrdiff_t >> src_stride, \ >> + int h, int mx, int my) >> \ >> +{ >> \ >> + LOCAL_ALIGNED_16(uint8_t, temp, [72 * 64]); >> \ >> + /* We only need h + 7 lines, but the horizontal filter assumes an >> \ >> + * even number of rows, so filter h + 8 lines here. */ >> \ >> + ff_vp9_put_##filter##sz##_h_neon(temp, 64, >> \ >> + src - 3 * src_stride, src_stride, >> \ >> + h + 8, mx, 0); >> \ >> + ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride, >> \ >> + temp + 3 * 64, 64, >> \ >> + h, 0, my); >> \ >> +} > > Since this reads one more line of input than necessary (h + 8 instead of h + > 7), I had to squash in this change locally: > > diff --git a/libavcodec/vp9.h b/libavcodec/vp9.h > index 84bed6d..ba622b1 100644 > --- a/libavcodec/vp9.h > +++ b/libavcodec/vp9.h > @@ -420,7 +420,7 @@ typedef struct VP9Context { > // whole-frame cache > uint8_t *intra_pred_data[3]; > VP9Filter *lflvl; > - DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[71 * 80]; > + DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[72 * 80]; > > // block reconstruction intermediates > int16_t *block_base, *block, *uvblock_base[2], *uvblock[2]; > > > Nothing else is needed, since that part of the buffer doesn't need to be > initialized; it just needs to be allocated and not out of bounds. > > (In the hv filter, the last uninitialized line is filtered horizontally when > the filter does 2 lines at a time, but the vertical half doesn't read it.) Just to be safe, I extended this fix a bit further like this: 80, // Martin diff --git a/libavcodec/vp9.h b/libavcodec/vp9.h index 84bed6d..e4b9f82 100644 --- a/libavcodec/vp9.h +++ b/libavcodec/vp9.h @@ -420,7 +420,8 @@ typedef struct VP9Context { // whole-frame cache uint8_t *intra_pred_data[3]; VP9Filter *lflvl; - DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[71 * 80]; + // This requires 64 + 8 rows, with 80 bytes stride + DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[72 * 80]; // block reconstruction intermediates int16_t *block_base, *block, *uvblock_base[2], *uvblock[2]; diff --git a/libavcodec/vp9block.c b/libavcodec/vp9block.c index 5a3b356..ba60b54 100644 --- a/libavcodec/vp9block.c +++ b/libavcodec/vp9block.c @@ -1176,8 +1176,10 @@ static av_always_inline void mc_luma_dir(VP9Context *s, vp9_mc_func(*mc)[2], ff_thread_await_progress(ref_frame, FFMAX(th, 0), 0); // FIXME bilinear filter only needs 0/1 pixels, not 3/4 + // The arm/aarch64 _hv filters might read one more row than what + // actually is needed, so switch to emulated edge one line sooner. if (x < !!mx * 3 || y < !!my * 3 || - x + !!mx * 4 > w - bw || y + !!my * 4 > h - bh) { + x + !!mx * 4 > w - bw || y + !!my * 5 > h - bh) { s->vdsp.emulated_edge_mc(s->edge_emu_buffer, ref - !!my * 3 * ref_stride - !!mx * 3,