arm: vp9: Add NEON optimizations of VP9 MC functions

Message ID alpine.DEB.2.11.1610191231030.32173@cone.martin.st
State Superseded
Headers show

Commit Message

Martin Storsjö Oct. 19, 2016, 9:32 a.m.
On Wed, 19 Oct 2016, Martin Storsjö wrote:

> On Tue, 11 Oct 2016, Martin Storsjö wrote:
>
>> This work is sponsored by, and copyright, Google.
>> 
>> The filter coefficients are signed values, where the product of the
>> multiplication with one individual filter coefficient doesn't
>> overflow a 16 bit signed value (the largest filter coefficient is
>> 127). But when the products are accumulated, the resulting sum can
>> overflow the 16 bit signed range. Instead of accumulating in 32 bit,
>> we accumulate all filter taps but the largest one in one register, and
>> the largest one (either index 3 or 4) in a separate one, added with
>> saturation afterwards.
>> 
>> (The VP8 MC asm does something similar, but slightly simpler, by
>> accumulating each half of the filter separately. In the VP9 MC
>> filters, each half of the filter can also overflow though, so the
>> largest component has to be handled individually.)
>> 
>> Examples of relative speedup compared with the C version, from checkasm:
>>                       Cortex      A7     A8     A9    A53
>> vp9_avg4_neon:                   1.63   1.19   1.37   1.54
>> vp9_avg8_neon:                   2.24   3.65   3.30   2.56
>> vp9_avg16_neon:                  2.64   6.72   2.92   2.80
>> vp9_avg32_neon:                  2.50   5.45   2.76   2.45
>> vp9_avg64_neon:                  2.69   5.81   2.72   2.79
>> vp9_avg_8tap_smooth_4h_neon:     3.29   4.71   2.90   4.78
>> vp9_avg_8tap_smooth_4hv_neon:    3.78   4.54   3.33   4.51
>> vp9_avg_8tap_smooth_4v_neon:     5.18   6.52   4.29   5.50
>> vp9_avg_8tap_smooth_8h_neon:     6.47   9.07   5.47   9.66
>> vp9_avg_8tap_smooth_8hv_neon:    6.60   8.16   5.98   7.91
>> vp9_avg_8tap_smooth_8v_neon:     9.25  12.67   8.07   9.61
>> vp9_avg_8tap_smooth_64h_neon:    6.98  10.36   5.91  11.59
>> vp9_avg_8tap_smooth_64hv_neon:   6.39   9.15   6.05   8.48
>> vp9_avg_8tap_smooth_64v_neon:   10.64  14.12   9.39  11.05
>> vp9_put4_neon:                   1.30   1.15   0.89   1.35
>> vp9_put8_neon:                   1.28   2.07   1.80   1.62
>> vp9_put16_neon:                  1.64   4.08   1.71   1.93
>> vp9_put32_neon:                  1.53   3.68   2.12   1.68
>> vp9_put64_neon:                  2.01   3.98   1.91   1.91
>> vp9_put_8tap_smooth_4h_neon:     3.05   4.47   2.68   4.53
>> vp9_put_8tap_smooth_4hv_neon:    3.74   4.46   3.31   4.50
>> vp9_put_8tap_smooth_4v_neon:     5.23   6.19   4.28   5.87
>> vp9_put_8tap_smooth_8h_neon:     5.89   8.40   4.97   8.99
>> vp9_put_8tap_smooth_8hv_neon:    6.84   8.05   5.90   7.91
>> vp9_put_8tap_smooth_8v_neon:     9.41  11.97   7.99  10.49
>> vp9_put_8tap_smooth_64h_neon:    6.61   9.79   5.06  11.35
>> vp9_put_8tap_smooth_64hv_neon:   7.01   9.13   6.37   9.17
>> vp9_put_8tap_smooth_64v_neon:   11.09  13.31   9.32  12.52
>> 
>> For the larger 8tap filters, the speedup vs C code is around 6-14x.
>> 
>> This is significantly faster than libvpx's implementation of the same
>> functions, at least when comparing the put_8tap_smooth_64 functions
>> (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from
>> libvpx).
>> 
>> Absolute runtimes from checkasm:
>>                          Cortex      A7        A8        A9       A53
>> vp9_put_8tap_smooth_64h_neon:    21229.8   14474.5   19790.1   10885.1
>> libvpx vpx_convolve8_horiz_neon: 52623.3   19736.4   21907.7   25027.7
>> 
>> vp9_put_8tap_smooth_64v_neon:    14966.6   12297.2   13786.5   11679.4
>> libvpx vpx_convolve8_vert_neon:  42090.0   17706.2   17659.9   16941.2
>> 
>> Thus, on the A9, the horizontal filter is only marginally faster than
>> libvpx, while our version is significantly faster on the other cores,
>> and the vertical filter is significantly faster on all cores. The
>> difference is especially large on the A7.
>> 
>> The libvpx implementation does the accumulation in 32 bit, which
>> probably explains most the differences.
>> ---
>> Since the previous version, I tuned the avg4 and put4 a bit further,
>> making avg4 faster than C on the A8, and improving put4 a little on A7
>> and A53. put4 is still marginally slower than the C version on A9 though,
>> but I think it isn't worth the trouble to try to workaround it.
>> 
>> Rewrapped some paragraphs in the commit message, that were unnecessarily
>> narrow. Changed the big benchmark table into one with relative speedups.
>> ---
>> libavcodec/arm/Makefile          |   2 +
>> libavcodec/arm/vp9dsp_init_arm.c | 140 +++++++
>> libavcodec/arm/vp9mc_neon.S      | 787 
>> +++++++++++++++++++++++++++++++++++++++
>> libavcodec/vp9.h                 |   1 +
>> libavcodec/vp9dsp.c              |   2 +
>> 5 files changed, 932 insertions(+)
>> create mode 100644 libavcodec/arm/vp9dsp_init_arm.c
>> create mode 100644 libavcodec/arm/vp9mc_neon.S
>> 
>> diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
>> index bd4dd4e..2638230 100644
>> --- a/libavcodec/arm/Makefile
>> +++ b/libavcodec/arm/Makefile
>> @@ -45,6 +45,7 @@ OBJS-$(CONFIG_MLP_DECODER)             += 
>> arm/mlpdsp_init_arm.o
>> OBJS-$(CONFIG_RV40_DECODER)            += arm/rv40dsp_init_arm.o
>> OBJS-$(CONFIG_VORBIS_DECODER)          += arm/vorbisdsp_init_arm.o
>> OBJS-$(CONFIG_VP6_DECODER)             += arm/vp6dsp_init_arm.o
>> +OBJS-$(CONFIG_VP9_DECODER)             += arm/vp9dsp_init_arm.o
>> 
>> 
>> # ARMv5 optimizations
>> @@ -138,3 +139,4 @@ NEON-OBJS-$(CONFIG_RV40_DECODER)       += 
>> arm/rv34dsp_neon.o            \
>>                                           arm/rv40dsp_neon.o
>> NEON-OBJS-$(CONFIG_VORBIS_DECODER)     += arm/vorbisdsp_neon.o
>> NEON-OBJS-$(CONFIG_VP6_DECODER)        += arm/vp6dsp_neon.o
>> +NEON-OBJS-$(CONFIG_VP9_DECODER)        += arm/vp9mc_neon.o
>> diff --git a/libavcodec/arm/vp9dsp_init_arm.c 
>> b/libavcodec/arm/vp9dsp_init_arm.c
>> new file mode 100644
>> index 0000000..db8c683
>> --- /dev/null
>> +++ b/libavcodec/arm/vp9dsp_init_arm.c
>> @@ -0,0 +1,140 @@
>> +/*
>> + * Copyright (c) 2016 Google Inc.
>> + *
>> + * This file is part of Libav.
>> + *
>> + * Libav is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU Lesser General Public
>> + * License as published by the Free Software Foundation; either
>> + * version 2.1 of the License, or (at your option) any later version.
>> + *
>> + * Libav is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with Libav; if not, write to the Free Software
>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
>> 02110-1301 USA
>> + */
>> +
>> +#include <stdint.h>
>> +
>> +#include "libavutil/attributes.h"
>> +#include "libavutil/arm/cpu.h"
>> +#include "libavcodec/vp9.h"
>> +
>> +#define declare_fpel(type, sz)                                          \
>> +void ff_vp9_##type##sz##_neon(uint8_t *dst, ptrdiff_t dst_stride,       \
>> +                              const uint8_t *src, ptrdiff_t src_stride, \
>> +                              int h, int mx, int my)
>> +
>> +#define declare_copy_avg(sz) \
>> +    declare_fpel(copy, sz);  \
>> +    declare_fpel(avg , sz)
>> +
>> +#define decl_mc_func(op, filter, dir, sz) 
>> \
>> +void ff_vp9_##op##_##filter##sz##_##dir##_neon(uint8_t *dst, ptrdiff_t 
>> dst_stride,       \
>> +                                               const uint8_t *src, 
>> ptrdiff_t src_stride, \
>> +                                               int h, int mx, int my)
>> +
>> +#define define_8tap_2d_fn(op, filter, sz) 
>> \
>> +static void op##_##filter##sz##_hv_neon(uint8_t *dst, ptrdiff_t 
>> dst_stride,       \
>> +                                        const uint8_t *src, ptrdiff_t 
>> src_stride, \
>> +                                        int h, int mx, int my) 
>> \
>> +{ 
>> \
>> +    LOCAL_ALIGNED_16(uint8_t, temp, [72 * 64]); 
>> \
>> +    /* We only need h + 7 lines, but the horizontal filter assumes an 
>> \
>> +     * even number of rows, so filter h + 8 lines here. */ 
>> \
>> +    ff_vp9_put_##filter##sz##_h_neon(temp, 64, 
>> \
>> +                                     src - 3 * src_stride, src_stride, 
>> \
>> +                                     h + 8, mx, 0); 
>> \
>> +    ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride, 
>> \
>> +                                        temp + 3 * 64, 64, 
>> \
>> +                                        h, 0, my); 
>> \
>> +}
>
> Since this reads one more line of input than necessary (h + 8 instead of h + 
> 7), I had to squash in this change locally:
>
> diff --git a/libavcodec/vp9.h b/libavcodec/vp9.h
> index 84bed6d..ba622b1 100644
> --- a/libavcodec/vp9.h
> +++ b/libavcodec/vp9.h
> @@ -420,7 +420,7 @@ typedef struct VP9Context {
>     // whole-frame cache
>     uint8_t *intra_pred_data[3];
>     VP9Filter *lflvl;
> -    DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[71 * 80];
> +    DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[72 * 80];
>
>     // block reconstruction intermediates
>     int16_t *block_base, *block, *uvblock_base[2], *uvblock[2];
>
>
> Nothing else is needed, since that part of the buffer doesn't need to be 
> initialized; it just needs to be allocated and not out of bounds.
>
> (In the hv filter, the last uninitialized line is filtered horizontally when 
> the filter does 2 lines at a time, but the vertical half doesn't read it.)

Just to be safe, I extended this fix a bit further like this:

                                   80,


// Martin

Patch

diff --git a/libavcodec/vp9.h b/libavcodec/vp9.h
index 84bed6d..e4b9f82 100644
--- a/libavcodec/vp9.h
+++ b/libavcodec/vp9.h
@@ -420,7 +420,8 @@  typedef struct VP9Context {
      // whole-frame cache
      uint8_t *intra_pred_data[3];
      VP9Filter *lflvl;
-    DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[71 * 80];
+    // This requires 64 + 8 rows, with 80 bytes stride
+    DECLARE_ALIGNED(32, uint8_t, edge_emu_buffer)[72 * 80];

      // block reconstruction intermediates
      int16_t *block_base, *block, *uvblock_base[2], *uvblock[2];
diff --git a/libavcodec/vp9block.c b/libavcodec/vp9block.c
index 5a3b356..ba60b54 100644
--- a/libavcodec/vp9block.c
+++ b/libavcodec/vp9block.c
@@ -1176,8 +1176,10 @@  static av_always_inline void mc_luma_dir(VP9Context 
*s, vp9_mc_func(*mc)[2],
      ff_thread_await_progress(ref_frame, FFMAX(th, 0), 0);

      // FIXME bilinear filter only needs 0/1 pixels, not 3/4
+    // The arm/aarch64 _hv filters might read one more row than what
+    // actually is needed, so switch to emulated edge one line sooner.
      if (x < !!mx * 3 || y < !!my * 3 ||
-        x + !!mx * 4 > w - bw || y + !!my * 4 > h - bh) {
+        x + !!mx * 4 > w - bw || y + !!my * 5 > h - bh) {
          s->vdsp.emulated_edge_mc(s->edge_emu_buffer,
                                   ref - !!my * 3 * ref_stride - !!mx * 3,