I2S on ESP3 – Part 2, WAV’s

In this article/episode we are going to follow on from the last one on I2S and get our software to play back some music. We’ll look at the WAV file format and how we can get the data out and send it over I2S. If you haven’t read/watched the previous episode then this is recommended – although not essential. That article covers the circuit wiring, explanation of digital sound and I2S in detail. The video for this article is below, time codes are available in the video description for the various parts of the video if you want to quickly jump to what you want to watch. You’ll find the write up below the video.

Watch the video for a full explanation covering more than what’s written here.

Parts List
Here’s the parts list for what you are seeing, along with affiliate links for these items.

ESP32 : https://amzn.to/3kb02n8
ESP32 – Pack of three : https://amzn.to/2XfIRqH
I2S Decoder : MAX98357A : https://amzn.to/3fkHEnU
3W Speaker : https://amzn.to/2XeRP7i
Breadboards : https://amzn.to/30fWibZ
Wires : https://amzn.to/3k4PKoC
Pins and sockets : https://amzn.to/39O7a3K (I2S board doesn’t some with them)

WAV File Format
The Wav file format is fairly simple, here;s the basic structure diagram;

This is the start or what is termed the header part of the file. A WAV file’s structure follows a format called RIFF (Resource Interchange File Format) and is a generic file container format for storing data in what’s termed tagged chunks and it’s primarily used to store multimedia such as sound and video, though it may also be used to store any arbitrary data. Don’t worry too much about the terminology as you can see at the end of the day we are just dealing with bytes in a file, nothing more.

As you can see it’s divided up into 3 sections, the RIFF, format, and data sections. The byte offset is the position in the file or memory where this part starts.

Riff Section, the RIFF ID is stored at offset 0, i.e. right at the start of the file. It is literally the letters R,I,F,F and takes up four bytes and so the next data starts at offset 4. The next important one for us is the “Format” at offset 8. This should contain the letters W,A,V,E if its a wave file. When we look at the code you’ll see that we make some checks on these things and reject the file if it’s not correct.

Format Section, the ID is four bytes long but only the first three are used for the letters fmt. We then make some other checks, at offset 20 we have the format of our WAVE file. We are only going to accept uncompressed PCM and reject otherwise so this should be a 1 for us, anything else is a no no.

Channels for our purposes should only be the numbers 1 or 2 for either mono or stereo sound. This is two bytes long so we could in theory have upto 65535 channels of we wished for a particular data type. Sample rate is the rate at which we should play back samples per second, very important to run at the correct speed. Our next important part for us is the Bits Per Sample. This should only be one of the four types shown and our program will look at this to ensure it retrieves the correct number of bytes per sample. At the moment for simplicity our code will only support 16 bit data.

Data section, would be the largest part of the file as it contains the actual digitised sound. The ID at offset 36 must be the letters “data”, data size is the size of the WAV data, and in fact knowing this along with the sample rate and Bits Per Sample we can actually calculate the running time in seconds for this sound.

And finally at offset 44 the actual data starts. Now how this is formed would depend on the Bits Per Sample and the Channels. If we were mono sound and say 16 bits per sample then the first two bytes would be the first sample and then the next two the second sample. As shown;

However if there were 2 channels (i.e. stereo) and the Bits Per Sample was 24 then the first 3 bytes would be the the value for the first left sample and the next 3 the value for the right channel first sample. And that would repeat until the end of the data. As shown;

It’s a point to note that if the RIFF file is a WAVE file then the data will always start at offset 44, for other types of data it probably won’t do.

The Code – Demo 1
Here’s the first demo code shown in the video, download and extract the zip for the sketch folder. It only supports 16 bit stereo sound but is very simple.

Demo 2 : Supports mono or stereo sound.

Both code examples are deliberately simple to allow you to see what is going on. In the future we will support different bits per sample etc. and the complexity will increase but you will be able ti see the principle of what we are doing in these more cut back examples. The descriptions of the code is in the video, it is far easier to follow there than writing it here. You can use the time codes in the video description to jump straight to what you need.

Putting in your own sounds.
Obviously you’ll want to put in your own sounds. If you’ve watched the video then you’ll know just how big quality WAV sounds can be. For simple effects then it is do-able within the ESP32’s built in memory but for music it can be challenging without reducing the quality too much. We will present a solution to this in the next article using an SD card which will provide ample storage. For now if you want to add you own WAV’s into the ESP’s internal memory then follow the instructions in the video, again time-codes are in the description.

The Next Article
So as mentioned we will add an SD card into our system and learn how to read and play the WAV’s from there, till then….