[RFC] h264dec: add a CUVID hwaccel

Message ID 1487498989-22793-1-git-send-email-anton@khirnov.net
State New
Headers show

Commit Message

Anton Khirnov Feb. 19, 2017, 10:09 a.m.
Some parts of the code are based on a patch by
Timo Rothenpieler <timo@rothenpieler.org>
---
Compared to the ffmpeg patch which implements cuvid as a separate decoder using
the higher-level parser API (nvcuvid.h), I did it as a classic hwaccel using
the lower-level decoder API (cuviddec.h).
IMO, this has a number of advantages:
 - integrates much better with the existing acceleration infrastructure/APIs
 - supports stream parameters changes
 - the code is much simpler
 - software fallback
 - various features from h264dec, such as handling weird invalid streams or
   exporting metadata from SEIs

One question to be resolved is retrieving the frames. The way the API works is
that the decoder maintains and internal pool of frames, to which the caller
refers by their indices. When you want the data, you map the frame, which allows
you to copy its contents to a normal CUDA frame. To get optimal performance,
this map+copy needs to be delayed wrt decoding by a few frames, so the question
is how this should be done. The options I see are:
 - introduce a new pixel format, AV_PIX_FMT_CUVID, which wraps the frame index
   and allows transfer to CUDA via av_hwframe_transfer_data(). Then either
   * Return those PIX_FMT_CUVID frames to the caller and let them do the copy
     manually. This is most flexible, but more work for the caller and might
     mean synchronization problems, so we'd need to add locks (perhaps to the
     CUVID frames context).
   * Handle delay+map+copy somewhere else in lavc. The question is where
     would the right place be. Janne suggested at FOSDEM to add a dummy decoder,
     h264_cuvid wrapping h264dec, which would do the delay and copy. That should
     work, but isn't very elegant.
 - we could also add some sort of a "postprocess" stage to AVHWaccel, run before
   returning a frame from decode(), or perhaps invoked separately by the lavc
   generic code.
This issue might be relevant to other future hwaccels as well (VT?), so ideally
the solution would be generic. Comments and further suggestions very much
welcome.
---
 Makefile                |   1 +
 avconv.h                |   2 +
 avconv_cuvid.c          |  83 ++++++++++++++++++
 avconv_opt.c            |   3 +
 configure               |   7 ++
 libavcodec/Makefile     |   2 +
 libavcodec/allcodecs.c  |   1 +
 libavcodec/cuvid.c      | 224 ++++++++++++++++++++++++++++++++++++++++++++++++
 libavcodec/cuvid.h      |  52 +++++++++++
 libavcodec/cuvid_h264.c | 170 ++++++++++++++++++++++++++++++++++++
 libavcodec/h264_slice.c |   6 +-
 11 files changed, 550 insertions(+), 1 deletion(-)
 create mode 100644 avconv_cuvid.c
 create mode 100644 libavcodec/cuvid.c
 create mode 100644 libavcodec/cuvid.h
 create mode 100644 libavcodec/cuvid_h264.c

Comments

wm4 Feb. 19, 2017, 1:48 p.m. | #1
On Sun, 19 Feb 2017 11:09:49 +0100
Anton Khirnov <anton@khirnov.net> wrote:

> Some parts of the code are based on a patch by
> Timo Rothenpieler <timo@rothenpieler.org>
> ---
> Compared to the ffmpeg patch which implements cuvid as a separate decoder using
> the higher-level parser API (nvcuvid.h), I did it as a classic hwaccel using
> the lower-level decoder API (cuviddec.h).
> IMO, this has a number of advantages:
>  - integrates much better with the existing acceleration infrastructure/APIs
>  - supports stream parameters changes
>  - the code is much simpler
>  - software fallback
>  - various features from h264dec, such as handling weird invalid streams or
>    exporting metadata from SEIs
> 



> One question to be resolved is retrieving the frames. The way the API works is
> that the decoder maintains and internal pool of frames, to which the caller
> refers by their indices. When you want the data, you map the frame, which allows
> you to copy its contents to a normal CUDA frame. To get optimal performance,
> this map+copy needs to be delayed wrt decoding by a few frames, so the question
> is how this should be done. The options I see are:
>  - introduce a new pixel format, AV_PIX_FMT_CUVID, which wraps the frame index
>    and allows transfer to CUDA via av_hwframe_transfer_data(). Then either
>    * Return those PIX_FMT_CUVID frames to the caller and let them do the copy
>      manually. This is most flexible, but more work for the caller and might
>      mean synchronization problems, so we'd need to add locks (perhaps to the
>      CUVID frames context).
>    * Handle delay+map+copy somewhere else in lavc. The question is where
>      would the right place be. Janne suggested at FOSDEM to add a dummy decoder,
>      h264_cuvid wrapping h264dec, which would do the delay and copy. That should
>      work, but isn't very elegant.
>  - we could also add some sort of a "postprocess" stage to AVHWaccel, run before
>    returning a frame from decode(), or perhaps invoked separately by the lavc
>    generic code.
> This issue might be relevant to other future hwaccels as well (VT?), so ideally
> the solution would be generic. Comments and further suggestions very much
> welcome.

What is with all this complexity? Is this about the final read-back if
you want to decode to system RAM? In this case, let it the API user do,
like any decent API user already does, and which your first point
suggests. (This means you need to hack avconv.c.) Not sure why "locks"
would be needed for this.

(But certainly I insist that av_hwframe_transfer_data() can be called
in any thread - everything else is insane.)

VT is a whole different question.
Luca Barbato Feb. 19, 2017, 5:22 p.m. | #2
On 19/02/2017 11:09, Anton Khirnov wrote:
>  - introduce a new pixel format, AV_PIX_FMT_CUVID, which wraps the frame index
>    and allows transfer to CUDA via av_hwframe_transfer_data(). Then either
>    * Return those PIX_FMT_CUVID frames to the caller and let them do the copy
>      manually. This is most flexible, but more work for the caller and might
>      mean synchronization problems, so we'd need to add locks (perhaps to the
>      CUVID frames context).
>    * Handle delay+map+copy somewhere else in lavc. The question is where
>      would the right place be. Janne suggested at FOSDEM to add a dummy decoder,
>      h264_cuvid wrapping h264dec, which would do the delay and copy. That should
>      work, but isn't very elegant.

The problem I see with this approach is that it would have to be
duplicated for the many other decoders supported by cuvid, so over time
we'd have to make it generic.

>  - we could also add some sort of a "postprocess" stage to AVHWaccel, run before
>    returning a frame from decode(), or perhaps invoked separately by the lavc
>    generic code.

Sounds sort of nicer, I'd keep it internal to lavc if it doesn't hurt
performance.

lu
Luca Barbato Feb. 19, 2017, 5:31 p.m. | #3
On 19/02/2017 14:48, wm4 wrote:
> VT is a whole different question.

CoreMedia buffers probably should be exported as-is and we should have
ways to map it to opencl.

For CUVID we might go this way as well, but seems simpler for everybody
to just provide cuda buffers out.

lu
Anton Khirnov Feb. 24, 2017, 6:35 p.m. | #4
Quoting wm4 (2017-02-19 14:48:36)
> On Sun, 19 Feb 2017 11:09:49 +0100
> Anton Khirnov <anton@khirnov.net> wrote:
> 
> > Some parts of the code are based on a patch by
> > Timo Rothenpieler <timo@rothenpieler.org>
> > ---
> > Compared to the ffmpeg patch which implements cuvid as a separate decoder using
> > the higher-level parser API (nvcuvid.h), I did it as a classic hwaccel using
> > the lower-level decoder API (cuviddec.h).
> > IMO, this has a number of advantages:
> >  - integrates much better with the existing acceleration infrastructure/APIs
> >  - supports stream parameters changes
> >  - the code is much simpler
> >  - software fallback
> >  - various features from h264dec, such as handling weird invalid streams or
> >    exporting metadata from SEIs
> > 
> 
> 
> 
> > One question to be resolved is retrieving the frames. The way the API works is
> > that the decoder maintains and internal pool of frames, to which the caller
> > refers by their indices. When you want the data, you map the frame, which allows
> > you to copy its contents to a normal CUDA frame. To get optimal performance,
> > this map+copy needs to be delayed wrt decoding by a few frames, so the question
> > is how this should be done. The options I see are:
> >  - introduce a new pixel format, AV_PIX_FMT_CUVID, which wraps the frame index
> >    and allows transfer to CUDA via av_hwframe_transfer_data(). Then either
> >    * Return those PIX_FMT_CUVID frames to the caller and let them do the copy
> >      manually. This is most flexible, but more work for the caller and might
> >      mean synchronization problems, so we'd need to add locks (perhaps to the
> >      CUVID frames context).
> >    * Handle delay+map+copy somewhere else in lavc. The question is where
> >      would the right place be. Janne suggested at FOSDEM to add a dummy decoder,
> >      h264_cuvid wrapping h264dec, which would do the delay and copy. That should
> >      work, but isn't very elegant.
> >  - we could also add some sort of a "postprocess" stage to AVHWaccel, run before
> >    returning a frame from decode(), or perhaps invoked separately by the lavc
> >    generic code.
> > This issue might be relevant to other future hwaccels as well (VT?), so ideally
> > the solution would be generic. Comments and further suggestions very much
> > welcome.
> 
> What is with all this complexity? Is this about the final read-back if
> you want to decode to system RAM? In this case, let it the API user do,
> like any decent API user already does, and which your first point
> suggests. (This means you need to hack avconv.c.) Not sure why "locks"
> would be needed for this.

No, this is about reading the frame from the internal decoder pool into
user-managed GPU memory.
Luca Barbato March 13, 2017, 5:03 a.m. | #5
On 19/02/2017 11:09, Anton Khirnov wrote:
> Some parts of the code are based on a patch by
> Timo Rothenpieler <timo@rothenpieler.org>
> ---
> Compared to the ffmpeg patch which implements cuvid as a separate decoder using
> the higher-level parser API (nvcuvid.h), I did it as a classic hwaccel using
> the lower-level decoder API (cuviddec.h).
> IMO, this has a number of advantages:
>  - integrates much better with the existing acceleration infrastructure/APIs
>  - supports stream parameters changes
>  - the code is much simpler
>  - software fallback
>  - various features from h264dec, such as handling weird invalid streams or
>    exporting metadata from SEIs
> 
> One question to be resolved is retrieving the frames. The way the API works is
> that the decoder maintains and internal pool of frames, to which the caller
> refers by their indices. When you want the data, you map the frame, which allows
> you to copy its contents to a normal CUDA frame. To get optimal performance,
> this map+copy needs to be delayed wrt decoding by a few frames, so the question
> is how this should be done. The options I see are:
>  - introduce a new pixel format, AV_PIX_FMT_CUVID, which wraps the frame index
>    and allows transfer to CUDA via av_hwframe_transfer_data(). Then either
>    * Return those PIX_FMT_CUVID frames to the caller and let them do the copy
>      manually. This is most flexible, but more work for the caller and might
>      mean synchronization problems, so we'd need to add locks (perhaps to the
>      CUVID frames context).
>    * Handle delay+map+copy somewhere else in lavc. The question is where
>      would the right place be. Janne suggested at FOSDEM to add a dummy decoder,
>      h264_cuvid wrapping h264dec, which would do the delay and copy. That should
>      work, but isn't very elegant.
>  - we could also add some sort of a "postprocess" stage to AVHWaccel, run before
>    returning a frame from decode(), or perhaps invoked separately by the lavc
>    generic code.
> This issue might be relevant to other future hwaccels as well (VT?), so ideally
> the solution would be generic. Comments and further suggestions very much
> welcome.

I'd land the code as-is and refactor the memory mapping from there.

Did you look on how the deinterlacer can be wired in btw?

lu

Patch

diff --git a/Makefile b/Makefile
index 98eb3ab..c5575a8 100644
--- a/Makefile
+++ b/Makefile
@@ -82,6 +82,7 @@  ALLAVPROGS  = $(AVBASENAMES:%=%$(EXESUF))
 $(foreach prog,$(AVBASENAMES),$(eval OBJS-$(prog) += cmdutils.o))
 
 OBJS-avconv                   += avconv_opt.o avconv_filter.o
+OBJS-avconv-$(CONFIG_CUVID)   += avconv_cuvid.o
 OBJS-avconv-$(CONFIG_LIBMFX)  += avconv_qsv.o
 OBJS-avconv-$(CONFIG_VAAPI)   += avconv_vaapi.o
 OBJS-avconv-$(CONFIG_VDA)     += avconv_vda.o
diff --git a/avconv.h b/avconv.h
index 3c3f0ef..b3b943a 100644
--- a/avconv.h
+++ b/avconv.h
@@ -56,6 +56,7 @@  enum HWAccelID {
     HWACCEL_VDA,
     HWACCEL_QSV,
     HWACCEL_VAAPI,
+    HWACCEL_CUVID,
 };
 
 typedef struct HWAccel {
@@ -509,5 +510,6 @@  int qsv_init(AVCodecContext *s);
 int qsv_transcode_init(OutputStream *ost);
 int vaapi_decode_init(AVCodecContext *avctx);
 int vaapi_device_init(const char *device);
+int cuvid_init(AVCodecContext *avctx);
 
 #endif /* AVCONV_H */
diff --git a/avconv_cuvid.c b/avconv_cuvid.c
new file mode 100644
index 0000000..cb29f51
--- /dev/null
+++ b/avconv_cuvid.c
@@ -0,0 +1,83 @@ 
+/*
+ * This file is part of Libav.
+ *
+ * Libav is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * Libav is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with Libav; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <cuda.h>
+#include <stdlib.h>
+
+#include "libavutil/dict.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda.h"
+#include "libavutil/mem.h"
+#include "libavutil/opt.h"
+
+#include "avconv.h"
+
+static void cuvid_uninit(AVCodecContext *s)
+{
+    InputStream *ist = s->opaque;
+    av_buffer_unref(&ist->hw_frames_ctx);
+}
+
+static int cuvid_device_init(InputStream *ist)
+{
+    int err;
+
+    err = av_hwdevice_ctx_create(&hw_device_ctx, AV_HWDEVICE_TYPE_CUDA,
+                                 ist->hwaccel_device, NULL, 0);
+    if (err < 0) {
+        av_log(NULL, AV_LOG_ERROR, "Error creating a QSV device\n");
+        return err;
+    }
+
+    return 0;
+}
+
+int cuvid_init(AVCodecContext *s)
+{
+    InputStream *ist = s->opaque;
+    AVHWFramesContext *frames_ctx;
+    int ret;
+
+    if (!hw_device_ctx) {
+        ret = cuvid_device_init(ist);
+        if (ret < 0)
+            return ret;
+    }
+
+    av_buffer_unref(&ist->hw_frames_ctx);
+    ist->hw_frames_ctx = av_hwframe_ctx_alloc(hw_device_ctx);
+    if (!ist->hw_frames_ctx)
+        return AVERROR(ENOMEM);
+
+    frames_ctx   = (AVHWFramesContext*)ist->hw_frames_ctx->data;
+
+    frames_ctx->width             = FFALIGN(s->coded_width,  32);
+    frames_ctx->height            = FFALIGN(s->coded_height, 32);
+    frames_ctx->format            = AV_PIX_FMT_CUDA;
+    frames_ctx->sw_format         = AV_PIX_FMT_NV12;
+
+    ret = av_hwframe_ctx_init(ist->hw_frames_ctx);
+    if (ret < 0) {
+        av_log(NULL, AV_LOG_ERROR, "Error initializing a CUDA frame pool\n");
+        return ret;
+    }
+
+    ist->hwaccel_uninit     = cuvid_uninit;
+
+    return 0;
+}
diff --git a/avconv_opt.c b/avconv_opt.c
index e078a0b..39c0a12 100644
--- a/avconv_opt.c
+++ b/avconv_opt.c
@@ -56,6 +56,9 @@ 
 }
 
 const HWAccel hwaccels[] = {
+#if CONFIG_CUVID
+    { "cuvid", cuvid_init, HWACCEL_CUVID, AV_PIX_FMT_CUDA },
+#endif
 #if HAVE_VDPAU_X11
     { "vdpau", vdpau_init, HWACCEL_VDPAU, AV_PIX_FMT_VDPAU },
 #endif
diff --git a/configure b/configure
index 9ebc3bf..8941814 100755
--- a/configure
+++ b/configure
@@ -236,6 +236,7 @@  External library support:
 
   The following libraries provide various hardware acceleration features:
   --enable-cuda    Nvidia CUDA (dynamically linked)
+  --enable-cuvid   Nvidia CUVID video decode acceleration
   --enable-d3d11va Microsoft Direct3D 11 video acceleration [auto]
   --enable-dxva2   Microsoft DirectX 9 video acceleration [auto]
   --enable-libmfx  Intel MediaSDK (AKA Quick Sync Video)
@@ -1257,6 +1258,7 @@  EXAMPLE_LIST="
 
 HWACCEL_LIBRARY_NONFREE_LIST="
     cuda
+    cuvid
     libnpp
 "
 HWACCEL_LIBRARY_LIST="
@@ -2162,6 +2164,8 @@  vda_extralibs="-framework CoreFoundation -framework VideoDecodeAcceleration -fra
 
 h263_vaapi_hwaccel_deps="vaapi"
 h263_vaapi_hwaccel_select="h263_decoder"
+h264_cuvid_hwaccel_deps="cuvid CUVIDH264PICPARAMS"
+h264_cuvid_hwaccel_select="h264_decoder"
 h264_d3d11va_hwaccel_deps="d3d11va"
 h264_d3d11va_hwaccel_select="h264_decoder"
 h264_dxva2_hwaccel_deps="dxva2"
@@ -4599,6 +4603,8 @@  check_lib psapi    "windows.h psapi.h"    GetProcessMemoryInfo -lpsapi
 
 check_struct "sys/time.h sys/resource.h" "struct rusage" ru_maxrss
 
+check_type "cuviddec.h" "CUVIDH264PICPARAMS"
+
 check_type "windows.h dxva.h" "DXVA_PicParams_HEVC" -DWINAPI_FAMILY=WINAPI_FAMILY_DESKTOP_APP -D_CRT_BUILD_DESKTOP_APP=0
 check_type "windows.h d3d11.h" "ID3D11VideoDecoder"
 check_type "d3d9.h dxva2api.h" DXVA2_ConfigPictureDecode -D_WIN32_WINNT=0x0602
@@ -4654,6 +4660,7 @@  done
 enabled avisynth          && require_header avisynth/avisynth_c.h
 enabled avxsynth          && require avxsynth "avxsynth/avxsynth_c.h dlfcn.h" dlopen -ldl
 enabled cuda              && require cuda cuda.h cuInit -lcuda
+enabled cuvid             && require libnvcuvid cuviddec.h cuvidCreateDecoder -lnvcuvid
 enabled frei0r            && require_header frei0r.h
 enabled gnutls            && require_pkg_config "" gnutls gnutls/gnutls.h gnutls_global_init
 enabled libbs2b           && require_pkg_config "" libbs2b bs2b.h bs2b_open
diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index 7d28d66..af33bf6 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -616,6 +616,7 @@  OBJS-$(CONFIG_ADPCM_YAMAHA_DECODER)       += adpcm.o adpcm_data.o
 OBJS-$(CONFIG_ADPCM_YAMAHA_ENCODER)       += adpcmenc.o adpcm_data.o
 
 # hardware accelerators
+OBJS-$(CONFIG_CUVID)                      += cuvid.o
 OBJS-$(CONFIG_D3D11VA)                    += dxva2.o
 OBJS-$(CONFIG_DXVA2)                      += dxva2.o
 OBJS-$(CONFIG_VAAPI)                      += vaapi_decode.o
@@ -623,6 +624,7 @@  OBJS-$(CONFIG_VDA)                        += vda.o
 OBJS-$(CONFIG_VDPAU)                      += vdpau.o
 
 OBJS-$(CONFIG_H263_VAAPI_HWACCEL)         += vaapi_mpeg4.o
+OBJS-$(CONFIG_H264_CUVID_HWACCEL)         += cuvid_h264.o
 OBJS-$(CONFIG_H264_D3D11VA_HWACCEL)       += dxva2_h264.o
 OBJS-$(CONFIG_H264_DXVA2_HWACCEL)         += dxva2_h264.o
 OBJS-$(CONFIG_H264_QSV_HWACCEL)           += qsvdec_h2645.o
diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c
index 46c42c5..ae829f4 100644
--- a/libavcodec/allcodecs.c
+++ b/libavcodec/allcodecs.c
@@ -68,6 +68,7 @@  void avcodec_register_all(void)
 
     /* hardware accelerators */
     REGISTER_HWACCEL(H263_VAAPI,        h263_vaapi);
+    REGISTER_HWACCEL(H264_CUVID,        h264_cuvid);
     REGISTER_HWACCEL(H264_D3D11VA,      h264_d3d11va);
     REGISTER_HWACCEL(H264_DXVA2,        h264_dxva2);
     REGISTER_HWACCEL(H264_MMAL,         h264_mmal);
diff --git a/libavcodec/cuvid.c b/libavcodec/cuvid.c
new file mode 100644
index 0000000..d98a134
--- /dev/null
+++ b/libavcodec/cuvid.c
@@ -0,0 +1,224 @@ 
+/*
+ * HW decode acceleration through CUVID
+ *
+ * Copyright (c) 2016 Anton Khirnov
+ *
+ * This file is part of Libav.
+ *
+ * Libav is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * Libav is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with Libav; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <cuda.h>
+#include <cuviddec.h>
+
+#include "libavutil/common.h"
+#include "libavutil/error.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda.h"
+#include "libavutil/pixdesc.h"
+#include "libavutil/pixfmt.h"
+
+#include "avcodec.h"
+#include "cuvid.h"
+#include "internal.h"
+
+static int map_avcodec_id(enum AVCodecID id)
+{
+    switch (id) {
+    case AV_CODEC_ID_H264: return cudaVideoCodec_H264;
+    }
+    return -1;
+}
+
+static int map_chroma_format(enum AVPixelFormat pix_fmt)
+{
+    int shift_h = 0, shift_v = 0;
+
+    av_pix_fmt_get_chroma_sub_sample(pix_fmt, &shift_h, &shift_v);
+
+    if (shift_h == 1 && shift_v == 1)
+        return cudaVideoChromaFormat_420;
+    else if (shift_h == 1 && shift_v == 0)
+        return cudaVideoChromaFormat_422;
+    else if (shift_h == 0 && shift_v == 0)
+        return cudaVideoChromaFormat_444;
+
+    return -1;
+}
+
+int ff_cuvid_decode_init(AVCodecContext *avctx)
+{
+    CUVIDContext *ctx = avctx->internal->hwaccel_priv_data;
+
+    AVHWFramesContext   *frames_ctx;
+    AVCUDADeviceContext *device_hwctx;
+
+    CUVIDDECODECREATEINFO params = { 0 };
+    CUcontext dummy;
+    CUresult err;
+
+    int cuvid_codec_type, cuvid_chroma_format;
+    int ret = 0;
+
+    cuvid_codec_type = map_avcodec_id(avctx->codec_id);
+    if (cuvid_codec_type < 0) {
+        av_log(avctx, AV_LOG_ERROR, "Unsupported codec ID\n");
+        return AVERROR_BUG;
+    }
+
+    cuvid_chroma_format = map_chroma_format(avctx->sw_pix_fmt);
+    if (cuvid_chroma_format < 0) {
+        av_log(avctx, AV_LOG_ERROR, "Unsupported chroma format\n");
+        return AVERROR(ENOSYS);
+    }
+
+    if (!avctx->hw_frames_ctx) {
+        av_log(avctx, AV_LOG_ERROR, "A hardware frames context is "
+               "required for CUVID decoding.\n");
+        return AVERROR(EINVAL);
+    }
+    frames_ctx   = (AVHWFramesContext*)avctx->hw_frames_ctx->data;
+    device_hwctx = frames_ctx->device_ctx->hwctx;
+
+    ctx->cuda_ctx = device_hwctx->cuda_ctx;
+
+    params.ulWidth             = avctx->coded_width;
+    params.ulHeight            = avctx->coded_height;
+    params.ulTargetWidth       = avctx->coded_width;
+    params.ulTargetHeight      = avctx->coded_height;
+    params.OutputFormat        = cudaVideoSurfaceFormat_NV12;
+    params.CodecType           = cuvid_codec_type;
+    params.ChromaFormat        = cuvid_chroma_format;
+    params.ulNumDecodeSurfaces = 32;
+    params.ulNumOutputSurfaces = 1;
+
+    err = cuCtxPushCurrent(ctx->cuda_ctx);
+    if (err != CUDA_SUCCESS)
+        return AVERROR_UNKNOWN;
+
+    err = cuvidCreateDecoder(&ctx->decoder, &params);
+    if (err != CUDA_SUCCESS) {
+        ret = AVERROR_UNKNOWN;
+        goto finish;
+    }
+
+finish:
+    cuCtxPopCurrent(&dummy);
+
+    return ret;
+}
+
+int ff_cuvid_decode_uninit(AVCodecContext *avctx)
+{
+    CUVIDContext *ctx = avctx->internal->hwaccel_priv_data;
+
+    av_freep(&ctx->bitstream);
+    ctx->bitstream_len       = 0;
+    ctx->bitstream_allocated = 0;
+
+    av_freep(&ctx->slice_offsets);
+    ctx->nb_slices               = 0;
+    ctx->slice_offsets_allocated = 0;
+
+    if (ctx->decoder)
+        cuvidDestroyDecoder(ctx->decoder);
+    ctx->decoder = NULL;
+
+    return 0;
+}
+
+int ff_cuvid_start_frame(AVCodecContext *avctx)
+{
+    CUVIDContext *ctx = avctx->internal->hwaccel_priv_data;
+
+    ctx->bitstream_len = 0;
+    ctx->nb_slices     = 0;
+
+    return 0;
+}
+
+int ff_cuvid_end_frame(AVCodecContext *avctx, CUVIDFrame *frame)
+{
+    CUVIDContext  *ctx = avctx->internal->hwaccel_priv_data;
+    CUVIDPICPARAMS *pp = &ctx->pic_params;
+
+    CUVIDPROCPARAMS vpp = { .progressive_frame = 1 };
+    CUresult err;
+    CUcontext dummy;
+    CUdeviceptr devptr;
+
+    unsigned int pitch, i;
+    unsigned int offset = 0;
+    int ret = 0;
+
+    pp->nBitstreamDataLen = ctx->bitstream_len;
+    pp->pBitstreamData    = ctx->bitstream;
+    pp->nNumSlices        = ctx->nb_slices;
+    pp->pSliceDataOffsets = ctx->slice_offsets;
+
+    err = cuCtxPushCurrent(ctx->cuda_ctx);
+    if (err != CUDA_SUCCESS)
+        return AVERROR_UNKNOWN;
+
+    err = cuvidDecodePicture(ctx->decoder, &ctx->pic_params);
+    if (err != CUDA_SUCCESS) {
+        av_log(avctx, AV_LOG_ERROR, "Error decoding a picture with CUVID: %d\n",
+               err);
+        ret = AVERROR_UNKNOWN;
+        goto finish;
+    }
+
+    if (pp->field_pic_flag && !pp->second_field)
+        goto finish;
+
+    err = cuvidMapVideoFrame(ctx->decoder, frame->idx, &devptr, &pitch, &vpp);
+    if (err != CUDA_SUCCESS) {
+        av_log(avctx, AV_LOG_ERROR, "Error mapping a picture with CUVID: %d\n",
+               err);
+        ret = AVERROR_UNKNOWN;
+        goto finish;
+    }
+
+    for (i = 0; frame->f->data[i]; i++) {
+        CUDA_MEMCPY2D cpy = {
+            .srcMemoryType = CU_MEMORYTYPE_DEVICE,
+            .dstMemoryType = CU_MEMORYTYPE_DEVICE,
+            .srcDevice     = devptr,
+            .dstDevice     = (CUdeviceptr)frame->f->data[i],
+            .srcPitch      = pitch,
+            .dstPitch      = frame->f->linesize[i],
+            .srcY          = offset,
+            .WidthInBytes  = FFMIN(pitch, frame->f->linesize[i]),
+            .Height        = avctx->coded_height >> (i ? 1 : 0),
+        };
+
+        err = cuMemcpy2D(&cpy);
+        if (err != CUDA_SUCCESS) {
+            av_log(avctx, AV_LOG_ERROR, "Error copying decoded frame: %d\n",
+                   err);
+            ret = AVERROR_UNKNOWN;
+            goto copy_fail;
+        }
+
+        offset += cpy.Height;
+    }
+
+copy_fail:
+    cuvidUnmapVideoFrame(ctx->decoder, devptr);
+
+finish:
+    cuCtxPopCurrent(&dummy);
+    return ret;
+}
diff --git a/libavcodec/cuvid.h b/libavcodec/cuvid.h
new file mode 100644
index 0000000..e92e807
--- /dev/null
+++ b/libavcodec/cuvid.h
@@ -0,0 +1,52 @@ 
+/*
+ * HW decode acceleration through CUVID
+ *
+ * Copyright (c) 2016 Anton Khirnov
+ *
+ * This file is part of Libav.
+ *
+ * Libav is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * Libav is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with Libav; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_CUVID_H
+#define AVCODEC_CUVID_H
+
+#include <cuviddec.h>
+
+typedef struct CUVIDFrame {
+    int idx;
+    AVFrame *f;
+} CUVIDFrame;
+
+typedef struct CUVIDContext {
+    CUcontext      cuda_ctx;
+    CUvideodecoder decoder;
+    CUVIDPICPARAMS pic_params;
+
+    uint8_t     *bitstream;
+    int          bitstream_len;
+    unsigned int bitstream_allocated;
+
+    unsigned    *slice_offsets;
+    int          nb_slices;
+    unsigned int slice_offsets_allocated;
+} CUVIDContext;
+
+int ff_cuvid_decode_init(AVCodecContext *avctx);
+int ff_cuvid_decode_uninit(AVCodecContext *avctx);
+int ff_cuvid_start_frame(AVCodecContext *avctx);
+int ff_cuvid_end_frame(AVCodecContext *avctx, CUVIDFrame *frame);
+
+#endif /* AVCODEC_CUVID_H */
diff --git a/libavcodec/cuvid_h264.c b/libavcodec/cuvid_h264.c
new file mode 100644
index 0000000..770576f
--- /dev/null
+++ b/libavcodec/cuvid_h264.c
@@ -0,0 +1,170 @@ 
+/*
+ * MPEG-4 Part 10 / AVC / H.264 HW decode acceleration through CUVID
+ *
+ * Copyright (c) 2016 Anton Khirnov
+ *
+ * This file is part of Libav.
+ *
+ * Libav is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * Libav is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with Libav; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <cuviddec.h>
+#include <stdint.h>
+#include <string.h>
+
+#include "avcodec.h"
+#include "cuvid.h"
+#include "internal.h"
+#include "h264dec.h"
+
+static void dpb_add(const H264Context *h, CUVIDH264DPBENTRY *dst, const H264Picture *src,
+                    int frame_idx)
+{
+    const CUVIDFrame *frame = src->hwaccel_picture_private;
+
+    dst->PicIdx             = frame->idx;
+    dst->FrameIdx           = frame_idx;
+    dst->is_long_term       = src->long_ref;
+    dst->not_existing       = 0;
+    dst->used_for_reference = src->reference & 3;
+    dst->FieldOrderCnt[0]   = src->field_poc[0];
+    dst->FieldOrderCnt[1]   = src->field_poc[1];
+}
+
+static int cuvid_h264_start_frame(AVCodecContext *avctx,
+                                  const uint8_t *buffer, uint32_t size)
+{
+    const H264Context *h = avctx->priv_data;
+    const PPS *pps = h->ps.pps;
+    const SPS *sps = h->ps.sps;
+
+    CUVIDContext       *ctx = avctx->internal->hwaccel_priv_data;
+    CUVIDPICPARAMS      *pp = &ctx->pic_params;
+    CUVIDH264PICPARAMS *ppc = &pp->CodecSpecific.h264;
+    CUVIDFrame       *frame = h->cur_pic_ptr->hwaccel_picture_private;
+
+    int i, dpb_size;
+
+    frame->idx = h->cur_pic_ptr - h->DPB;
+    frame->f   = h->cur_pic_ptr->f;
+
+    *pp = (CUVIDPICPARAMS) {
+        .PicWidthInMbs     = h->mb_width,
+        .FrameHeightInMbs  = h->mb_height,
+        .CurrPicIdx        = frame->idx,
+        .field_pic_flag    = FIELD_PICTURE(h),
+        .bottom_field_flag = h->picture_structure == PICT_BOTTOM_FIELD,
+        .second_field      = FIELD_PICTURE(h) && !h->first_field,
+        .ref_pic_flag      = h->nal_ref_idc != 0,
+        .intra_pic_flag    = 0,
+
+        .CodecSpecific.h264 = {
+            .log2_max_frame_num_minus4            = sps->log2_max_frame_num - 4,
+            .pic_order_cnt_type                   = sps->poc_type,
+            .log2_max_pic_order_cnt_lsb_minus4    = FFMAX(sps->log2_max_poc_lsb - 4, 0),
+            .delta_pic_order_always_zero_flag     = sps->delta_pic_order_always_zero_flag,
+            .frame_mbs_only_flag                  = sps->frame_mbs_only_flag,
+            .direct_8x8_inference_flag            = sps->direct_8x8_inference_flag,
+            .num_ref_frames                       = sps->ref_frame_count,
+            .residual_colour_transform_flag       = sps->residual_color_transform_flag,
+            .bit_depth_luma_minus8                = sps->bit_depth_luma - 8,
+            .bit_depth_chroma_minus8              = sps->bit_depth_chroma - 8,
+            .qpprime_y_zero_transform_bypass_flag = sps->transform_bypass,
+
+            .entropy_coding_mode_flag               = pps->cabac,
+            .pic_order_present_flag                 = pps->pic_order_present,
+            .num_ref_idx_l0_active_minus1           = pps->ref_count[0] - 1,
+            .num_ref_idx_l1_active_minus1           = pps->ref_count[1] - 1,
+            .weighted_pred_flag                     = pps->weighted_pred,
+            .weighted_bipred_idc                    = pps->weighted_bipred_idc,
+            .pic_init_qp_minus26                    = pps->init_qp - 26,
+            .deblocking_filter_control_present_flag = pps->deblocking_filter_parameters_present,
+            .redundant_pic_cnt_present_flag         = pps->redundant_pic_cnt_present,
+            .transform_8x8_mode_flag                = pps->transform_8x8_mode,
+            .MbaffFrameFlag                         = sps->mb_aff && !FIELD_PICTURE(h),
+            .constrained_intra_pred_flag            = pps->constrained_intra_pred,
+            .chroma_qp_index_offset                 = pps->chroma_qp_index_offset[0],
+            .second_chroma_qp_index_offset          = pps->chroma_qp_index_offset[1],
+            .ref_pic_flag                           = h->nal_ref_idc != 0,
+            .frame_num                              = h->poc.frame_num,
+            .CurrFieldOrderCnt[0]                   = h->cur_pic_ptr->field_poc[0],
+            .CurrFieldOrderCnt[1]                   = h->cur_pic_ptr->field_poc[1],
+        },
+    };
+
+    memcpy(ppc->WeightScale4x4,    pps->scaling_matrix4,    sizeof(ppc->WeightScale4x4));
+    memcpy(ppc->WeightScale8x8[0], pps->scaling_matrix8[0], sizeof(ppc->WeightScale8x8[0]));
+    memcpy(ppc->WeightScale8x8[1], pps->scaling_matrix8[3], sizeof(ppc->WeightScale8x8[0]));
+
+    dpb_size = 0;
+    for (i = 0; i < h->short_ref_count; i++)
+        dpb_add(h, &ppc->dpb[dpb_size++], h->short_ref[i], h->short_ref[i]->frame_num);
+    for (i = 0; i < 16; i++) {
+        if (h->long_ref[i])
+            dpb_add(h, &ppc->dpb[dpb_size++], h->long_ref[i], i);
+    }
+
+    for (i = dpb_size; i < FF_ARRAY_ELEMS(ppc->dpb); i++)
+        ppc->dpb[i].PicIdx       = -1;
+
+    return ff_cuvid_start_frame(avctx);
+}
+
+static int cuvid_h264_decode_slice(AVCodecContext *avctx, const uint8_t *buffer,
+                                   uint32_t size)
+{
+    CUVIDContext *ctx = avctx->internal->hwaccel_priv_data;
+    void *tmp;
+
+    tmp = av_fast_realloc(ctx->bitstream, &ctx->bitstream_allocated,
+                          ctx->bitstream_len + size + 3);
+    if (!tmp)
+        return AVERROR(ENOMEM);
+    ctx->bitstream = tmp;
+
+    tmp = av_fast_realloc(ctx->slice_offsets, &ctx->slice_offsets_allocated,
+                          (ctx->nb_slices + 1) * sizeof(*ctx->slice_offsets));
+    if (!tmp)
+        return AVERROR(ENOMEM);
+    ctx->slice_offsets = tmp;
+
+    AV_WB24(ctx->bitstream + ctx->bitstream_len, 1);
+    memcpy(ctx->bitstream + ctx->bitstream_len + 3, buffer, size);
+    ctx->slice_offsets[ctx->nb_slices] = ctx->bitstream_len ;
+    ctx->bitstream_len += size + 3;
+    ctx->nb_slices++;
+
+    return 0;
+}
+
+static int cuvid_h264_end_frame(AVCodecContext *avctx)
+{
+    H264Context *h = avctx->priv_data;
+    return ff_cuvid_end_frame(avctx, h->cur_pic_ptr->hwaccel_picture_private);
+}
+
+AVHWAccel ff_h264_cuvid_hwaccel = {
+    .name                 = "h264_cuvid",
+    .type                 = AVMEDIA_TYPE_VIDEO,
+    .id                   = AV_CODEC_ID_H264,
+    .pix_fmt              = AV_PIX_FMT_CUDA,
+    .start_frame          = cuvid_h264_start_frame,
+    .end_frame            = cuvid_h264_end_frame,
+    .decode_slice         = cuvid_h264_decode_slice,
+    .init                 = ff_cuvid_decode_init,
+    .uninit               = ff_cuvid_decode_uninit,
+    .priv_data_size       = sizeof(CUVIDContext),
+    .frame_priv_data_size = sizeof(CUVIDFrame),
+};
diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
index f1f5fc0..9984d40 100644
--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
@@ -720,7 +720,8 @@  static enum AVPixelFormat get_pixel_format(H264Context *h)
                      CONFIG_H264_D3D11VA_HWACCEL + \
                      CONFIG_H264_VAAPI_HWACCEL + \
                      (CONFIG_H264_VDA_HWACCEL * 2) + \
-                     CONFIG_H264_VDPAU_HWACCEL)
+                     CONFIG_H264_VDPAU_HWACCEL     + \
+                     CONFIG_H264_CUVID_HWACCEL)
     enum AVPixelFormat pix_fmts[HWACCEL_MAX + 2], *fmt = pix_fmts;
     const enum AVPixelFormat *choices = pix_fmts;
 
@@ -751,6 +752,9 @@  static enum AVPixelFormat get_pixel_format(H264Context *h)
 #if CONFIG_H264_VDPAU_HWACCEL
         *fmt++ = AV_PIX_FMT_VDPAU;
 #endif
+#if CONFIG_H264_CUVID_HWACCEL
+        *fmt++ = AV_PIX_FMT_CUDA;
+#endif
         if (CHROMA444(h)) {
             if (h->avctx->colorspace == AVCOL_SPC_RGB)
                 *fmt++ = AV_PIX_FMT_GBRP;