学好 MP4，让直播更给力

时间 2019-11-16

标签学好 mp4 直播给力繁體版

原文原文链接

原文连接为：villainhrjavascript

MP4 实际表明的含义是 MPEG-4 Part 14。它只是 MPEG 标准中的 14 部分。它主要参考 ISO/IEC
标准来制定的。MP4 主要做用是能够实现快进快放，边下载边播放的效果。他是基于 MOV，而后发展成本身相关的格式内容。而后和 MP4 相关的文件还有：3GP，M4V 这两种格式。前端

MP4 的格式稍微比 FLV 复杂一些，它是经过嵌的方式来实现整个数据的携带。换句话说，它的每一段内容，均可以变成一个对象，若是须要播放的话，只要获得相应的对象便可。java

MP4 中最基本的单元就是 Box，它内部是经过一个一个独立的 box 拼接而成的。因此，这里，咱们先从 Box 的讲解开始。app

PS：做为一个前端开发，在大部分场合了解 MP4 非但没用，并且有点浪费时间。本文推荐阅读是针对音视频开发感兴趣的同窗，特别是从事直播，或者，视频播放器业务相关的开发者。ide

MP4 box

MP4 box 能够分为 basic box 和 full box。优化

basic box: 主要针对的是相关的基础 box。好比 ftyp,moov 等。
full box: 主要针对视频源的 media box。

这里，再次强调一下，MP4 box 是 MP4 box 的核心。在 decode/encode 过程当中，最好把它的基本格式背下来，这样，你写起来会开心不少（经验之谈）。ui

OK，咱们来看一下，Box 的具体结构。编码

basic box

首先来看一下 basic box 的结构：url

若是用代码来表示就是：spa

aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) {
   unsigned int(32) size;
   unsigned int(32) type = boxtype;
   if (size==1) {
      unsigned int(64) largesize;
   }  else if (size==0) {
      // box extends to end of file
   }
   // 这里针对的是 MP4 extension 的盒子类型。通常不会发生
    if (boxtype==‘uuid’) {
    unsigned int(8)[16] usertype = extended_type;
    } 
}

上面代码其实已经说的很清楚了。这里，我在简单的阐述一下。

size[4B]: 用来代指该 box 的大小，包括 header 和 body。因为其大小有限制，有可能不知足超大的 box。因此，这里有一个判断逻辑，当 size===1 时，会出现一个 8B 的 largesize 字段来存放大小。当 size===0 时，表示文件的结束。
type[4B]: 用来标识该 box 的类型，其实内容很简单，就是直接取指定盒子的英文字母的 ASCII 码。由于 boxname 的长度只有 4 个字母，因此，只须要经过 charCodeAt API 获取 4 次便可。

// 得到指定 box 的 type 字段内容
val.charCodeAt(0)
val.charCodeAt(1)
val.charCodeAt(2)
val.charCodeAt(3)

实际整个盒子的结构能够用下图来表示：

这里须要强调的一点就是，在 MP4 中，默认写入字节序都是 Big-Endian 。因此，在上面，涉及到 4B 8B 等字段内容时，都是以 BE 来写入的。

上面不是说了，box 有两种基本格式吗？

还有一种为 fullBox

full box

full box 和 box 的主要区别是增长了 version 和 flag 字段。它的应用场景不高，主要是在 trak box 中使用。它的基本格式为：

aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) {
    unsigned int(8) version = v;
    bit(24) flags = f;
}

在实操中，若是你的没有针对 version 和 flags 的业务场景，那么基本上就能够直接设为默认值，好比 0x00。它的基本结构图为：

在实际 remux 中，会以 box 为最小组合单位，来完成相关的 remux 过程。好比，这里以 JS 来完成最小基本 box 的构造：

MP4.box = function (type) {
  let boxLength = 8; // include the total 8 byte length of size and type

  let buffers = Array.prototype.slice.call(arguments, 1);

  buffers.forEach(val => {
    boxLength += val.byteLength;
  });

  let boxBuffer = new Uint8Array(boxLength);
  // the first four byte stands for boxLength
  boxBuffer[0] = (boxLength >> 24) & 0xff;
  boxBuffer[1] = (boxLength >> 16) & 0xff;
  boxBuffer[2] = (boxLength >> 8) & 0xff;
  boxBuffer[3] = boxLength & 0xff;

  // the second four byte is box's type
  boxBuffer.set(type, 4);

  let offset = 8; // the byteLength of type and size

  buffers.forEach(val => {
    boxBuffer.set(val, offset);
    offset += val.byteLength;
  })

  return boxBuffer;

}

上述，一般的调用方法为：

// MP4.symbolValue.FTYP 为某一个具体的 Buffer
MP4.box(MP4.types.ftyp, MP4.symbolValue.FTYP);

接下来，咱们就要正式的来看一下，MP4 中真正用到的一些 Box 了。

这里，咱们按照 MP4 box 的划分来进行相关的阐述。先看一张 MP4 给出的结构图：

说明一下，咱们只讲带星号的 box。其余的由于不是必须 box，咱们就选择性的忽略了。不过，里面带星号的 Box 仍是挺多的。由于，咱们的主要目的是为了生成一个 MP4 文件。一个正常的 MP4 文件的结构并非全部带星号的 Box 都必须有。

正常播放的 MP4 文件其实还能够分为 unfragmented MP4（简写为 MP4）和 fragmented MP4（简写为 FMP4)。那这二者具体有什么区别呢？

能够说，彻底不一样。由于他们自己肯定 media stream 播放的方式都是彻底不一样的模式。

MP4 格式

基本 box 为：

上面这是最基本的 MP4 Box 内容。较完整的为：

MP4 box 根据 trak 中的 stbl 下的 stts stsc 等基本 box 来完成在 mdat box 中的索引。那 FMP4 是啥呢？

非标：非标经常使用于生成单一 trak 的文件。
- ftyp
- moov
- moof
- mdat
标准：用来生成含有多个 trak 的文件。
- ftyp
- moov
- mdat

看起来非标还多一个 box。但在具体编解码的时候，标准解码须要更多关注在如何编码 stbl 下的几个子 box--stts,stco,ctts 等盒子。而非标不须要关注 stbl，只须要将原本处于 stbl 的数据直接抽到 moof 中。而且在转换过程当中，moof 里面的格式相比 stbl 来讲，是很是简单的。因此，这里，咱们主要围绕上面两种的标准，来说解对应的 Box。

标准 MP4 盒子

ftyp

ftyp 盒子至关于就是该 mp4 的纲领性说明。即，告诉解码器它的基本解码版本，兼容格式。简而言之，就是用来告诉客户端，该 MP4 的使用的解码标准。一般，ftyp 都是放在 MP4 的开头。

它的格式为：

aligned(8) class FileTypeBox
   extends Box(‘ftyp’) {
   unsigned int(32)  major_brand;
   unsigned int(32)  minor_version;
   unsigned int(32) compatible_brands[];
}

上面的字段一概都是放在 data 字段中（参考，box 的描述）。

major_brand: 由于兼容性通常能够分为推荐兼容性和默认兼容性。这里 major_brand 就至关因而推荐兼容性。一般，在 Web 中解码，通常而言都是使用 isom 这个万金油便可。若是是须要特定的格式，能够自行定义。
minor_version: 指最低兼容版本。
compatible_brands: 和 major_brand 相似，一般是针对 MP4 中包含的额外格式，好比，AVC，AAC 等至关于的音视频解码格式。

说这么多概念，还不如给代码实在。这里，咱们能够来看一下，对于通用 ftyp box 的建立。

FTYP: new Uint8Array([
    0x69, 0x73, 0x6F, 0x6D, // major_brand: isom
    0x0, 0x0, 0x0, 0x1, // minor_version: 0x01
    0x69, 0x73, 0x6F, 0x6D, // isom
    0x61, 0x76, 0x63, 0x31 // avc1
  ])

moov

moov box 主要是做为一个很重要的容器盒子存在的，它自己的实际内容并不重要。moov 主要是存放相关的 trak 。其基本格式为：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

mvhd

mvhd 是 moov 下的第一个 box，用来描述 media 的相关信息。其基本内容为：

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { 
    if (version==1) {
   unsigned int(64)  creation_time;
   unsigned int(64)  modification_time;
   unsigned int(32)  timescale;
   unsigned int(64)  duration;
} else { // version==0
   unsigned int(32)  creation_time;
   unsigned int(32)  modification_time;
   unsigned int(32)  timescale;
   unsigned int(32)  duration;
}
template int(32)  rate = 0x00010000; // typically 1.0
template int(16)  volume = 0x0100;   // typically, full volume
const bit(16)  reserved = 0;
const unsigned int(32)[2]  reserved = 0;
template int(32)[9]  matrix =
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
      // Unity matrix
   bit(32)[6]  pre_defined = 0;
   unsigned int(32)  next_track_ID;
}

version: 通常默认为 0。
creation_time: 建立的时间。从 1904 年开始算起，用秒来表示。
timescale: 时间比例。经过该值和 duration 来算出实际时间
duration: 持续时间，单位是根据 timescale 来决定的。实际时间为：duration/timescale = xx 秒。
rate: 播放比例。
volume: 音量大小。0x0100 为最大值。
matrix: 不解释。我也不懂
next_track_ID: 须要比当前 trak_id 最大值还大才行。通常随便填个很大的值便可。

实际上，mvhd 大部分的值，均可以设为固定值：

new Uint8Array([
        0x00, 0x00, 0x00, 0x00, // version(0) + flags
        0x00, 0x00, 0x00, 0x00, // creation_time
        0x00, 0x00, 0x00, 0x00, // modification_time
        (timescale >>> 24) & 0xFF, // timescale: 4 bytes
        (timescale >>> 16) & 0xFF,
        (timescale >>> 8) & 0xFF,
        (timescale) & 0xFF,
        (duration >>> 24) & 0xFF, // duration: 4 bytes
        (duration >>> 16) & 0xFF,
        (duration >>> 8) & 0xFF,
        (duration) & 0xFF,
        0x00, 0x01, 0x00, 0x00, // Preferred rate: 1.0
        0x01, 0x00, 0x00, 0x00, // PreferredVolume(1.0, 2bytes) + reserved(2bytes)
        0x00, 0x00, 0x00, 0x00, // reserved: 4 + 4 bytes
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x01, 0x00, 0x00, // ----begin composition matrix----
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x01, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x40, 0x00, 0x00, 0x00, // ----end composition matrix----
        0x00, 0x00, 0x00, 0x00, // ----begin pre_defined 6 * 4 bytes----
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, // ----end pre_defined 6 * 4 bytes----
        0xFF, 0xFF, 0xFF, 0xFF // next_track_ID
    ]);

trak

trak box 就是主要存放相关 media stream 的内容。其基本格式很简单就是简单的 box：

aligned(8) class TrackBox extends Box(‘trak’) { }

不过，有时候里面也能够带上该 media stream 的相关描述：

tkhd

tkhd 是 trak box 的子一级 box 的内容。主要是用来描述该特定 trak 的相关内容信息。其主要内容为：

aligned(8) class TrackHeaderBox
extends FullBox(‘tkhd’, version, flags){ if (version==1) {
      unsigned int(64)  creation_time;
      unsigned int(64)  modification_time;
      unsigned int(32)  track_ID;
      const unsigned int(32)  reserved = 0;
      unsigned int(64)  duration;
   } else { // version==0
      unsigned int(32)  creation_time;
      unsigned int(32)  modification_time;
      unsigned int(32)  track_ID;
      const unsigned int(32)  reserved = 0;
      unsigned int(32)  duration;
}
const unsigned int(32)[2] reserved = 0;
template int(16) layer = 0;
template int(16) alternate_group = 0;
template int(16) volume = {if track_is_audio 0x0100 else 0}; 
const unsigned int(16) reserved = 0;
template int(32)[9] matrix=
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
      // unity matrix
   unsigned int(32) width;
   unsigned int(32) height;
}

上面内容确实挺多的，可是，有些并非必定须要填一些合法值。这里简单说明一下：

creation_time: 建立时间，非必须
modification_time: 修改时间，非必须
track_ID: 指明当前描述的 track ID。
duration: 当前 track 内容持续的时间。一般结合 timescale 进行相关计算。
layer: 没啥用。一般用来做为分红 video trak 的使用。
alternate_group: 可替换 track 源。若是为 0 表示当前 track 没有指定的 track 源替代。非 0 的话，则表示存在多个源的 group。
volume: 用来肯定音量大小。满音量为 1(0x0100)。
width and height：肯定视频的宽高

mdia

mdia 主要用来包裹相关的 media 信息。自己没啥说的，格式为：

aligned(8) class MediaBox extends Box(‘mdia’) { }

mdhd

mdhd 和 tkhd 来讲，内容大体都是同样的。不过，tkhd 一般是对指定的 track 设定相关属性和内容。而 mdhd 是针对于独立的 media 来设置的。不过事实上，二者通常都是同样的。

具体格式为：

aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) { if (version==1) {
      unsigned int(64)  creation_time;
      unsigned int(64)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(64)  duration;
   } else { // version==0
      unsigned int(32)  creation_time;
      unsigned int(32)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(32)  duration;
}
bit(1) pad = 0;
unsigned int(5)[3] language; // ISO-639-2/T language code unsigned int(16) pre_defined = 0;
}

里面就有 3 个额外的字段：pad，language，pre_defined。

根据字面意思很好理解：

pad: 占位符，一般为 0
language: 代表当前 trak 的语言。由于该字段总长为 15bit，一般是和 pad 组合成为 2B 的长度。
pre_defined: 默认为 0.

实际代码的计算方式为：

new Uint8Array([
    0x00, 0x00, 0x00, 0x00, // version(0) + flags
    0x00, 0x00, 0x00, 0x00, // creation_time
    0x00, 0x00, 0x00, 0x00, // modification_time
    (timescale >>> 24) & 0xFF, // timescale: 4 bytes
    (timescale >>> 16) & 0xFF,
    (timescale >>> 8) & 0xFF,
    (timescale) & 0xFF,
    (duration >>> 24) & 0xFF, // duration: 4 bytes
    (duration >>> 16) & 0xFF,
    (duration >>> 8) & 0xFF,
    (duration) & 0xFF,
    0x55, 0xC4, // language: und (undetermined)
    0x00, 0x00 // pre_defined = 0
  ])

hdlr

hdlr 是用来设置不一样 trak 的处理方式的。经常使用处理方式以下：

vide : Video track
soun : Audio track
hint : Hint track
meta : Timed Metadata track
auxv : Auxiliary Video track

这个，其实就和咱们在获得和接收到资源时，设置的 Content-Type 类型字段是一致的，例如 application/javascript。

其基本格式为：

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { 
unsigned int(32) pre_defined = 0;
unsigned int(32) handler_type;
const unsigned int(32)[3] reserved = 0;
string   name;
}

其中有两字段须要额外说明一下：

handler_type：是代指具体 trak 的处理类型。也就是咱们上面列写的 vide,soun,hint 字段。
name: 是用来写名字的。其主要不是给机器读的，而是给人读，因此，这里你只要以为能表述清楚，填啥其实都行。

handler_type 填的值其实就是 string 转换为 hex 以后获得的值。好比：

vide 为 0x76, 0x69, 0x64, 0x65
soun 为 0x73, 0x6F, 0x75, 0x6E

minf

minf 是子属内容中，重要的容器 box，用来存放当前 track 的基本描述信息。自己没啥说的，基本格式为：

aligned(8) class MediaInformationBox extends Box(‘minf’) { }

v/smhd

v/smhd 是对当前 trak 的描述 box。vmhd 针对的是 video，smhd 针对的是 audio。这两个盒子在解码中，非不可或缺的（有时候得看播放器），缺了的话，有可能会被认为格式不正确。

咱们先来看一下 vmhd 的基本格式：

aligned(8) class VideoMediaHeaderBox
extends FullBox(‘vmhd’, version = 0, 1) {
template unsigned int(16) graphicsmode = 0; // copy, see below 
template unsigned int(16)[3] opcolor = {0, 0, 0};
}

这很简单都是一些默认值，我这里就很少说了。

smhd 的格式一样也很简单：

aligned(8) class SoundMediaHeaderBox
   extends FullBox(‘smhd’, version = 0, 0) {
   template int(16) balance = 0;
   const unsigned int(16)  reserved = 0;
}

其中，balance 这个字段至关于和咱们一般设置的左声道，右声道有关。

balance: 该值是一个浮点值，0 为 center，1.0 为 right，-1.0 为 left。

dinf

dinf 是用来讲明在 trak 中，media 描述信息的位置。其实自己就是一个容器，没啥内容：

aligned(8) class DataInformationBox extends Box(‘dinf’) { }

dref

dref 是用来设置当前 Box 描述信息的 data_entry。基本格式为：

aligned(8) class DataReferenceBox
   extends FullBox(‘dref’, version = 0, 0) {
   unsigned int(32)  entry_count;
   for (i=1; i <= entry_count; i++) {
    DataEntryBox(entry_version, entry_flags) data_entry; }
}

其中的 DataEntryBox 就是 DataEntryUrlBox/DataEntryUrnBox 中的一个。简单来讲，就是 dref 下的子 box -- url 或者 urn 这两个 box。其中，entry_version 和 entry_flags 须要额外说明一下。

entry_version: 用来指明当前 entry 的格式
entry_flags: 其值不是固定的，可是有一个特殊的值, 0x000001 用来表示当前 media 的数据和 moov 包含的数据一致。

不过，就一般来讲，我真的没有用到过有实际数据的 dref 。因此，这里就不衍生来说了。

url

url box 是由 dref 包裹的子一级 box，里面是对不一样的 sample 的描述信息。不过，通常都是附带在其它 box 里。其基本格式为：

aligned(8) class DataEntryUrlBox (bit(24) flags) extends FullBox(‘url ’, version = 0, flags) { 
    string location;
}

实际并无用到过 location 这个字段，因此，通常也就不须要了。

stts

stts 主要是用来存储 refSampleDelta。即，相邻两帧间隔的时间。它基本格式为：

aligned(8) class TimeToSampleBox
   extends FullBox(’stts’, version = 0, 0) {
   unsigned int(32)  entry_count;
      int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_count;
      unsigned int(32)  sample_delta;
   }
}

看代码其实看不出什么，咱们结合实际抓包结果，来说解。现有以下的帧：

能够看到，上面的 Decode delta 值都是 10。这就对应着 sample_delta 的值。而 sample_count 就对应出现几回的 sample_delta。好比，上面 10 的 delta 出现了 14 次，那么 sample_count 就是 14。

若是对应于 RTMP 中的 Video Msg，那么 sample_delta 就是当前 RTMP Header 中，后面一个的 timeStamp delta。

stco

stco 是 stbl 包里面一个很是关键的 Box。它用来定义每个 sample 在 mdat 具体的位置。基本格式为：

aligned(8) class ChunkOffsetBox
extends FullBox(‘stco’, version = 0, 0) { 
unsigned int(32) entry_count;
for (i=1; i u entry_count; i++) {
      unsigned int(32)  chunk_offset;
   }
}

具体能够参考：

stco 有两种形式，若是你的视频过大的话，就有可能形成 chunkoffset 超过 32bit 的限制。因此，这里针对大 Video 额外建立了一个 co64 的 Box。它的功效等价于 stco，也是用来表示 sample 在 mdat box 中的位置。只是，里面 chunk_offset 是 64bit 的。

aligned(8) class ChunkLargeOffsetBox extends FullBox(‘co64’, version = 0, 0) { 
unsigned int(32) entry_count;
for (i=1; i u entry_count; i++) {
      unsigned int(64)  chunk_offset;
   }
}

stsc

stsc 这个 Box 有点绕，并非它的字段多，而是它的字段意思有点奇怪。其基本格式为：

aligned(8) class SampleToChunkBox
extends FullBox(‘stsc’, version = 0, 0) { 
    unsigned int(32) entry_count;
    for (i=1; i u entry_count; i++) {
    unsigned int(32) first_chunk;
    unsigned int(32) samples_per_chunk; 
    unsigned int(32) sample_description_index;
    } 
}

关键点在于他们里面的三个字段: first_chunk,samples_per_chunk,sample_description_index。

first_chunk: 每个 entry 开始的 chunk 位置。
samples_per_chunk: 每个 chunk 里面包含多少的 sample
sample_description_index: 每个 sample 的描述。通常能够默认设置为 1。

这 3 个字段实际上决定了一个 MP4 中有多少个 chunks，每一个 chunks 有多少个 samples。这里顺便普及一下 chunk 和 sample 的相关概念。在 MP4 文件中，最小的基本单位是 Chunk 而不是 Sample。

sample: 包含最小单元数据的 slice。里面有实际的 NAL 数据。
chunk: 里面包含的是一个一个的 sample。为了是优化数据的读取，让 I/O 更有效率。

看了上面字段就懂得，感受你要么是大牛，要么就是在装逼。官方文档和上面同样的描述，可是，看了一遍后，懵逼，再看一遍后，懵逼。因此，这里为了你们更好的理解，这里额外再补充一下。

前面说了，在 MP4 中最小的单位是 chunks，那么经过 stco 中定义的 chunk_offsets 字段，它描述的就是 chunks 在 mdat 中的位置。每个 stco chunk_offset 就对应于某一个 index 的 chunks。那么，first_chunk 就是用来定义该 chunk entry 开始的位置。

那这样的话，stsc 须要对每个 chunk 进行定义吗？

不须要，由于 stsc 是定义一整个 entry，即，若是他们的 samples_per_chunk，sample_description_index 不变的话，那么后续的 chunks 都是用同样的模式。

即，若是你的 stsc 只有：

first_chunk: 1
samples_per_chunk: 4
sample_description_index: 1

也就是说，从第一个 chunk 开始，每经过切分 4 个 sample 划分为一个 chunk，而且每一个 sample 的表述信息都是 1。它会按照这样划分方法一直持续到最后。固然，若是你的 sample 最后不能被 4 整除，最后的几段 sample 就会当作特例进行处理。

一般状况下，stsc 的值是不同的：

按照上面的状况就是，第 1 个 chunk 包含 2 个 samples。第 2-4 个 chunk 包含 1 个 sample，第 5 个 chunk 包含两个 chunk，第 6 个到最后一个 chunk 包含一个 sample。

ctts

ctts 主要针对 Video 中的 B 帧来肯定的。也就是说，若是你视频里面没有 B 帧，那么，ctts 的结构就很简单了。它主要的做用，是用来记录每个 sample 里面的 cts。格式为：

aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { 
    unsigned int(32) entry_count;
      int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_count;
      unsigned int(32)  sample_offset;
   }
}

仍是看实例吧，假如你视频中帧的排列以下：

其中，sample_offset 就是 Composition offset。经过合并一致的 Composition offset，能够获得对应的 sample_count。最终 ctts 的结果为：

看实例抓包的结果为：

若是，你是针对 RTMP 的 video，因为，其没有 B 帧，那么 ctts 的整个结果，就只有一个 sample_count 和 sample_offset。好比：

sample_count: 100
sample_offset: 0

一般只有 video track 才须要 ctts。

stsz

stsz 是用来存放每个 sample 的 size 信息的。基本格式为：

aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { 
unsigned int(32) sample_size;
unsigned int(32) sample_count;
if (sample_size==0) {
    for (i=1; i <= sample_count; i++) {
          unsigned int(32)  entry_size;
    } 
    }
}

这个没啥说的，就是全部 sample 的 size 大小，以及相应的描述信息。

fragmented MP4

前面部分是标准 box 的全部内容。固然，fMP4 里面大部份内容和 MP4 标准格式有不少重复的地方，剩下的就不过多赘述，只把不一样的单独挑出来说解。

mvex

mvex 是 fMP4 的标准盒子。它的做用是告诉解码器这是一个 fMP4 的文件，具体的 samples 信息内容再也不放到 trak 里面，而是在每个 moof 中。基本格式为：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

trex

trex 是 mvex 的子一级 box 用来给 fMP4 的 sample 设置默认值。基本内容为：

aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ 
    unsigned int(32) track_ID;
    unsigned int(32) default_sample_description_index;
    unsigned int(32) default_sample_duration;
    unsigned int(32) default_sample_size;
    unsigned int(32) default_sample_flags 
}

具体设哪个值，这得看你业务里面具体的要求才行。若是实在不知道，那就能够直接设置为 0：

new Uint8Array([
        0x00, 0x00, 0x00, 0x00, // version(0) + flags
        (trackId >>> 24) & 0xFF, // track_ID
        (trackId >>> 16) & 0xFF,
        (trackId >>> 8) & 0xFF,
        (trackId) & 0xFF,
        0x00, 0x00, 0x00, 0x01, // default_sample_description_index
        0x00, 0x00, 0x00, 0x00, // default_sample_duration
        0x00, 0x00, 0x00, 0x00, // default_sample_size
        0x00, 0x01, 0x00, 0x01 // default_sample_flags
    ])

moof

moof 主要是用来存放 FMP4 的相关内容。它自己没啥太多的内容：

aligned(8) class TrackFragmentBox extends Box(‘traf’){ 
}

tfhd

tfhd 主要是对指定的 trak 进行相关的默认设置。例如：sample 的时长，大小，偏移量等。不过，这些均可以忽略不设，只要你在其它 box 里面设置完整便可：

aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){
    unsigned int(32) track_ID;
// all the following are optional fields
 unsigned int(64) base_data_offset;
 unsigned int(32) sample_description_index;
 unsigned int(32) default_sample_duration;
 unsigned int(32) default_sample_size;
 unsigned int(32) default_sample_flags
}

base_data_offset 是用来计算后面数据偏移量用到的。若是存在则会用上，不然直接是相关开头的偏移。

tfdt

tfdt 主要是用来存放相关 sample 编码的绝对时间的。由于 FMP4 是流式的格式，因此，不像 MP4 同样能够直接根据 sample 直接 seek 到具体位置。这里就须要一个标准时间参考，来快速定位都某个具体的 fragment。

它的基本格式为：

aligned(8) class TrackFragmentBaseMediaDecodeTimeBox extends FullBox(‘tfdt’, version, 0) {
if (version==1) {
    unsigned int(64) baseMediaDecodeTime; 
} else { // version==0
    unsigned int(32) baseMediaDecodeTime;
    }
}

baseMediaDecodeTime 基本值是前面全部指定 trak_id 中 samples 持续时长的总和，至关于就是当前 traf 里面第一个 sample 的 dts 值。

trun

trun 存储该 moof 里面相关的 sample 内容。例如，每一个 sample 的 size，duration，offset 等。基本内容为：

aligned(8) class TrackRunBox
    extends FullBox(‘trun’, version, tr_flags) {
unsigned int(32) sample_count;
// the following are optional fields
signed int(32) data_offset;
unsigned int(32) first_sample_flags;
// all fields in the following array are optional {
      unsigned int(32)  sample_duration;
      unsigned int(32)  sample_size;
      unsigned int(32)  sample_flags
      if (version == 0)
         { unsigned int(32) sample_composition_time_offset
      else
         { signed int(32) sample_composition_time_offset
   }[ sample_count ]
}

能够说，trun 上面的字段是 traf 里面最重要的标识字段：

tr_flags 是用来表示下列 sample 相关的标识符是否应用到每一个字段中：

0x000001: data-offset-present，只应用 data-offset
0x000004: 只对第一个 sample 应用对应的 flags。剩余 sample flags 就无论了。
0x000100: 这个比较重要，表示每一个 sample 都有本身的 duration，不然使用默认的
0x000200: 每一个 sample 有本身的 sample_size，不然使用默认的。
0x000400: 对每一个 sample 使用本身的 flags。不然，使用默认的。
0x000800: 每一个 sample 都有本身的 cts 值

后面字段，咱们这简单介绍一下。

data_offset: 用来表示和该 moof 配套的 mdat 中实际数据内容距 moof 开头有多少 byte。至关于就是 moof.byteLength + mdat.headerSize。
sample_count: 一共有多少个 sample
first_sample_flags: 主要针对第一个 sample。通常来讲，均可以默认设为 0。

后面的几个字段，我就不赘述了，对了，里面的 sample_flags 是一个很是重要的东西，经常用它来表示，到底哪个 sampel 是对应的 keyFrame。基本计算方法为：

(flags.isLeading << 2) | flags.dependsOn, // sample_flags
(flags.isDepended << 6) | (flags.hasRedundancy << 4) | flags.isNonSync

sdtp

sdtp 主要是用来描述具体某个 sample 是不是 I 帧，是不是 leading frame 等相关属性值，主要用来做为当进行点播回放时的同步参考信息。其内容一共有 4 个：

is_leading：是不是开头部分。
- 0: 当前 sample 的 leading 属性未知（常常用到）
- 1: 当前 sample 是 leading sample，而且不能被 decoded
- 2: 当前 sample 并非 leading sample。
- 3: 当前 sample 是 leading sample，而且能被 decoded
sample_depends_on：是不是 I 帧。
- 0: 该 sample 不知道是否依赖其余帧
- 1: 该 sample 是 B/P 帧
- 2: 该 sample 是 I 帧。
- 3: 保留字
sample_is_depended_on: 该帧是否被依赖
- 0: 不知道是否被依赖，特指（B/P）
- 1: 被依赖，特指 I 帧
- 3: 保留字
sample_has_redundancy: 是否有冗余编码
- 0: 不知道是否有冗余
- 1: 有冗余编码
- 2: 没有冗余编码
- 3: 保留字

整个基本格式为：

aligned(8) class SampleDependencyTypeBox extends FullBox(‘sdtp’, version = 0, 0) { 
  for (i=0; i < sample_count; i++){
    unsigned int(2) is_leading;
    unsigned int(2) sample_depends_on; 
    unsigned int(2) sample_is_depended_on; 
    unsigned int(2) sample_has_redundancy;
  } 
}

sdtp 对于 video 来讲很重要，由于，其内容字段主要就是给 video 相关的帧设计的。而 audio，通常直接采用默认值：

isLeading: 0,
dependsOn: 1, 
isDepended: 0,
hasRedundancy: 0

到这里，整个 MP4 和 fMP4 的内容就已经介绍完了。更详细的内容能够参考 MP4 & FMP4 doc。

固然，这里只是很是皮毛的一部分，仅仅知道 box 的内容，并不足够来作一些音视频处理。更多的是关于音视频的基础知识，好比，dts/pts、音视频同步、视频盒子的封装等等。