arm: vp9: Add NEON optimizations of VP9 MC functions

Message ID alpine.DEB.2.11.1610191212570.32173@cone.martin.st
State Superseded
Headers show

Commit Message

Martin Storsjö Oct. 19, 2016, 9:16 a.m.
On Tue, 11 Oct 2016, Martin Storsjö wrote:

> This work is sponsored by, and copyright, Google.
>
> The filter coefficients are signed values, where the product of the
> multiplication with one individual filter coefficient doesn't
> overflow a 16 bit signed value (the largest filter coefficient is
> 127). But when the products are accumulated, the resulting sum can
> overflow the 16 bit signed range. Instead of accumulating in 32 bit,
> we accumulate all filter taps but the largest one in one register, and
> the largest one (either index 3 or 4) in a separate one, added with
> saturation afterwards.
>
> (The VP8 MC asm does something similar, but slightly simpler, by
> accumulating each half of the filter separately. In the VP9 MC
> filters, each half of the filter can also overflow though, so the
> largest component has to be handled individually.)
>
> Examples of relative speedup compared with the C version, from checkasm:
>                       Cortex      A7     A8     A9    A53
> vp9_avg4_neon:                   1.63   1.19   1.37   1.54
> vp9_avg8_neon:                   2.24   3.65   3.30   2.56
> vp9_avg16_neon:                  2.64   6.72   2.92   2.80
> vp9_avg32_neon:                  2.50   5.45   2.76   2.45
> vp9_avg64_neon:                  2.69   5.81   2.72   2.79
> vp9_avg_8tap_smooth_4h_neon:     3.29   4.71   2.90   4.78
> vp9_avg_8tap_smooth_4hv_neon:    3.78   4.54   3.33   4.51
> vp9_avg_8tap_smooth_4v_neon:     5.18   6.52   4.29   5.50
> vp9_avg_8tap_smooth_8h_neon:     6.47   9.07   5.47   9.66
> vp9_avg_8tap_smooth_8hv_neon:    6.60   8.16   5.98   7.91
> vp9_avg_8tap_smooth_8v_neon:     9.25  12.67   8.07   9.61
> vp9_avg_8tap_smooth_64h_neon:    6.98  10.36   5.91  11.59
> vp9_avg_8tap_smooth_64hv_neon:   6.39   9.15   6.05   8.48
> vp9_avg_8tap_smooth_64v_neon:   10.64  14.12   9.39  11.05
> vp9_put4_neon:                   1.30   1.15   0.89   1.35
> vp9_put8_neon:                   1.28   2.07   1.80   1.62
> vp9_put16_neon:                  1.64   4.08   1.71   1.93
> vp9_put32_neon:                  1.53   3.68   2.12   1.68
> vp9_put64_neon:                  2.01   3.98   1.91   1.91
> vp9_put_8tap_smooth_4h_neon:     3.05   4.47   2.68   4.53
> vp9_put_8tap_smooth_4hv_neon:    3.74   4.46   3.31   4.50
> vp9_put_8tap_smooth_4v_neon:     5.23   6.19   4.28   5.87
> vp9_put_8tap_smooth_8h_neon:     5.89   8.40   4.97   8.99
> vp9_put_8tap_smooth_8hv_neon:    6.84   8.05   5.90   7.91
> vp9_put_8tap_smooth_8v_neon:     9.41  11.97   7.99  10.49
> vp9_put_8tap_smooth_64h_neon:    6.61   9.79   5.06  11.35
> vp9_put_8tap_smooth_64hv_neon:   7.01   9.13   6.37   9.17
> vp9_put_8tap_smooth_64v_neon:   11.09  13.31   9.32  12.52
>
> For the larger 8tap filters, the speedup vs C code is around 6-14x.
>
> This is significantly faster than libvpx's implementation of the same
> functions, at least when comparing the put_8tap_smooth_64 functions
> (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from
> libvpx).
>
> Absolute runtimes from checkasm:
>                          Cortex      A7        A8        A9       A53
> vp9_put_8tap_smooth_64h_neon:    21229.8   14474.5   19790.1   10885.1
> libvpx vpx_convolve8_horiz_neon: 52623.3   19736.4   21907.7   25027.7
>
> vp9_put_8tap_smooth_64v_neon:    14966.6   12297.2   13786.5   11679.4
> libvpx vpx_convolve8_vert_neon:  42090.0   17706.2   17659.9   16941.2
>
> Thus, on the A9, the horizontal filter is only marginally faster than
> libvpx, while our version is significantly faster on the other cores,
> and the vertical filter is significantly faster on all cores. The
> difference is especially large on the A7.
>
> The libvpx implementation does the accumulation in 32 bit, which
> probably explains most the differences.
> ---
> Since the previous version, I tuned the avg4 and put4 a bit further,
> making avg4 faster than C on the A8, and improving put4 a little on A7
> and A53. put4 is still marginally slower than the C version on A9 though,
> but I think it isn't worth the trouble to try to workaround it.
>
> Rewrapped some paragraphs in the commit message, that were unnecessarily
> narrow. Changed the big benchmark table into one with relative speedups.
> ---
> libavcodec/arm/Makefile          |   2 +
> libavcodec/arm/vp9dsp_init_arm.c | 140 +++++++
> libavcodec/arm/vp9mc_neon.S      | 787 +++++++++++++++++++++++++++++++++++++++
> libavcodec/vp9.h                 |   1 +
> libavcodec/vp9dsp.c              |   2 +
> 5 files changed, 932 insertions(+)
> create mode 100644 libavcodec/arm/vp9dsp_init_arm.c
> create mode 100644 libavcodec/arm/vp9mc_neon.S
>
> diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
> index bd4dd4e..2638230 100644
> --- a/libavcodec/arm/Makefile
> +++ b/libavcodec/arm/Makefile
> @@ -45,6 +45,7 @@ OBJS-$(CONFIG_MLP_DECODER)             += arm/mlpdsp_init_arm.o
> OBJS-$(CONFIG_RV40_DECODER)            += arm/rv40dsp_init_arm.o
> OBJS-$(CONFIG_VORBIS_DECODER)          += arm/vorbisdsp_init_arm.o
> OBJS-$(CONFIG_VP6_DECODER)             += arm/vp6dsp_init_arm.o
> +OBJS-$(CONFIG_VP9_DECODER)             += arm/vp9dsp_init_arm.o
>
>
> # ARMv5 optimizations
> @@ -138,3 +139,4 @@ NEON-OBJS-$(CONFIG_RV40_DECODER)       += arm/rv34dsp_neon.o            \
>                                           arm/rv40dsp_neon.o
> NEON-OBJS-$(CONFIG_VORBIS_DECODER)     += arm/vorbisdsp_neon.o
> NEON-OBJS-$(CONFIG_VP6_DECODER)        += arm/vp6dsp_neon.o
> +NEON-OBJS-$(CONFIG_VP9_DECODER)        += arm/vp9mc_neon.o
> diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c
> new file mode 100644
> index 0000000..db8c683
> --- /dev/null
> +++ b/libavcodec/arm/vp9dsp_init_arm.c
> @@ -0,0 +1,140 @@
> +/*
> + * Copyright (c) 2016 Google Inc.
> + *
> + * This file is part of Libav.
> + *
> + * Libav is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * Libav is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with Libav; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <stdint.h>
> +
> +#include "libavutil/attributes.h"
> +#include "libavutil/arm/cpu.h"
> +#include "libavcodec/vp9.h"
> +
> +#define declare_fpel(type, sz)                                          \
> +void ff_vp9_##type##sz##_neon(uint8_t *dst, ptrdiff_t dst_stride,       \
> +                              const uint8_t *src, ptrdiff_t src_stride, \
> +                              int h, int mx, int my)
> +
> +#define declare_copy_avg(sz) \
> +    declare_fpel(copy, sz);  \
> +    declare_fpel(avg , sz)
> +
> +#define decl_mc_func(op, filter, dir, sz)                                                \
> +void ff_vp9_##op##_##filter##sz##_##dir##_neon(uint8_t *dst, ptrdiff_t dst_stride,       \
> +                                               const uint8_t *src, ptrdiff_t src_stride, \
> +                                               int h, int mx, int my)
> +
> +#define define_8tap_2d_fn(op, filter, sz)                                         \
> +static void op##_##filter##sz##_hv_neon(uint8_t *dst, ptrdiff_t dst_stride,       \
> +                                        const uint8_t *src, ptrdiff_t src_stride, \
> +                                        int h, int mx, int my)                    \
> +{                                                                                 \
> +    LOCAL_ALIGNED_16(uint8_t, temp, [72 * 64]);                                   \
> +    /* We only need h + 7 lines, but the horizontal filter assumes an             \
> +     * even number of rows, so filter h + 8 lines here. */                        \
> +    ff_vp9_put_##filter##sz##_h_neon(temp, 64,                                    \
> +                                     src - 3 * src_stride, src_stride,            \
> +                                     h + 8, mx, 0);                               \
> +    ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride,                          \
> +                                        temp + 3 * 64, 64,                        \
> +                                        h, 0, my);                                \
> +}

Since this reads one more line of input than necessary (h + 8 instead of h 
+ 7), I had to squash in this change locally:



Nothing else is needed, since that part of the buffer doesn't need to be 
initialized; it just needs to be allocated and not out of bounds.

(In the hv filter, the last uninitialized line is filtered horizontally 
when the filter does 2 lines at a time, but the vertical half doesn't read 
it.)

// Martin

Patch

diff --git a/libavcodec/vp9.h b/libavcodec/vp9.h
index 84bed6d..ba622b1 100644
--- a/libavcodec/vp9.h
+++ b/libavcodec/vp9.h
@@ -420,7 +420,7 @@  typedef struct VP9Context {
      // whole-frame cache
      uint8_t *intra_pred_data[3];
      VP9Filter *lflvl;
-    DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[71 * 80];
+    DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[72 * 80];

      // block reconstruction intermediates
      int16_t *block_base, *block, *uvblock_base[2], *uvblock[2];