Fast File Loading (Pt. 1)

In the first of a two-part feature, Pyro Studios tech lead Jesus de Santos Garcia presents a guide for loading streaming data quickly off of media, such as a computer's hard drive.

April 19, 2007

11 Min Read

Author: by Jesus de Santos Garcia

Reading data efficiently from Hard Disk and DVD units is vital for video games and one of the more important problems to solve in the next generation of games. While we are getting 20x performance in processing power and memory size, we are only getting 4x performance improvement in data devices (dvd for consoles).

I describe in this article how to efficiently read raw data from disk (hdd, dvd) oriented towards streaming files in a realtime application (although the concepts are useful in other areas). The platform used is Win32, but all the topics covered could be easily ported to other platforms.

I have included a project for Visual Studio 2005 with all the code described here and all the framework to test the different techniques. You can download it here. The machine where I have done the tests is a 3.2GHz Pentium 4. The devices used for testing are:

A 7200rpm Hard Disk with an average read performance of 46.8 MB/s (measured using Hd Tach)
A DVD-Rom unit (using a DVD+RW media) with a peak performance of 2.40Mb (measured using Nero CD-DVD Speed)

Windows (and all the Operating Systems in general) uses part of the physical memory (not being used by processes) for file caching (you can view how much memory is being used for the File System Cache in the Task Manager). To avoid Windows caching my test files I implemented a function that flushes the cache reading big files before measuring times. The tests were executed 10 times, lasting each one several minutes.

So let’s start the travel… Objective: having a 100MB file in the CPU memory as fast as possible.

1. The Standard and Portable way

The first option is using portable code from the standard C library. All we know the benefits of portability. So we allocate a buffer and fread() the file.

FILE *fp = fopen(FileName, “rb”);
fread(&g_buffer[0], 1, FileSize, fp);
fclose(fp);

Stats	Min (MB/s)	Max (MB/s)	Average (MB/s)
HDD	47.847	48.828	48.527
DVD	2.381	2.386	2.383

2. The Win32 Native way

Trying to improve the native approach we go for the Win32 native file functions: CreateFile and ReadFile

HANDLE hFile = CreateFile(FileName, GENERIC_READ, 0, 0, OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, 0);

DWORD dwNumberOfBytesRead = 0;
ReadFile(hFile, &g_buffer[0], FileSize, &dwNumberOfBytesRead, 0);

CloseHandle(hFile);

Stats	Min (MB/s)	Max (MB/s)	Average (MB/s)
HDD	47.483	48.852	48.497
DVD	2.383	2.390	2.386

Nearly the same performance. FILE_FLAG_SEQUENTIAL_SCAN is used to direct the Cache Manager to access the file sequentially. It is recommended to use it when reading large files with sequential access. I though that the FILE_FLAG_SEQUENTIAL_SCAN hint would give a better improvement than this but obviously the fread implementation in Win32 is doing a good job.

3. File Memory Mapping

Our next approach is trying memory mapped files where the system reads the data from disk on demand using the same mechanism that is used in Virtual Memory.

HANDLE hFile = CreateFile(FileName, GENERIC_READ, 0, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, 0);

HANDLE hFileMapping = CreateFileMapping(hFile, 0, PAGE_READONLY, 0, FileSize, 0);

int iPos = 0;
const unsigned int BlockSize = 128 * 1024;

while(iPos < FileSize)
{
int iLeft = FileSize - iPos;
int iBytesToRead = iLeft > BlockSize ? BlockSize: iLeft;

void *rawBuffer = MapViewOfFile(hFileMapping, FILE_MAP_READ, 0, iPos, iBytesToRead);
memcpy(&g_buffer[iPos], rawBuffer, iBytesToRead);
UnmapViewOfFile(rawBuffer);

iPos += iBytesToRead;
}

CloseHandle(hFileMapping);
CloseHandle(hFile);

Stats	Min (MB/s)	Max (MB/s)	Average (MB/s)
HDD	45.830	48.828	48.190
DVD	2.524	2.528	2.526

We are getting a significant improvement when reading from the DVD. Reading from Hard Disk is nearly the same performance.

4. Asynchronous I/O

Async I/O places disk requests in a queue for the disk controller and returns immediately. One of the best advantages of Async I/O is that the disk is always busy without entering and leaving from kernel mode. It is important to note here that we are bypassing the Win32 cache using the FILE_FLAG_NO_BUFFERING avoiding any unnecessary memory copy operation. The data moves directly into the application from DMA. I found hard to use the flag FILE_FLAG_OVERLAPPED (for Async I/O) without the FILE_FLAG_NO_BUFFERING. Most of the times, windows was not overlapping my disk requests.

I got the best results keeping 8 I/O requests active at all times. While the requests are being processed the CPU is copying the already processed ones.

By the way, when using FILE_FLAG_OVERLAPPED, you need to read to sector aligned buffers. I am using VirtualAlloc for this purpose.

for(int i = 0; < NumBlocks; i++)
{
// VirtualAlloc() creates storage that is page aligned
// and so is disk sector aligned
blocks[i] = static_cast
(VirtualAlloc(0, BlockSize, MEM_COMMIT, PAGE_READWRITE));

ZeroMemory(&overlapped[i], sizeof(OVERLAPPED));
overlapped[i].hEvent = CreateEvent(0, false, false, 0);
}

HANDLE hFile = CreateFile(FileName, GENERIC_READ, 0, 0, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_NO_BUFFERING |
FILE_FLAG_OVERLAPPED | FILE_FLAG_SEQUENTIAL_SCAN, 0);

int iWriterPos = 0;
int iReaderPos = 0;
int iIOPos = 0;
int iPos = 0;

do
{
while(iWriterPos - iReaderPos != NumBlocks && iIOPos < FileSize)
{
overlapped[iWriterPos & NumBlocksMask].Offset = iIOPos;

int iLeft = FileSize - iIOPos;
int iBytesToRead = iLeft > BlockSize ? BlockSize: iLeft;

const int iMaskedWriterPos = iWriterPos & NumBlocksMask;
ReadFile(hFile, blocks[iMaskedWriterPos], iBytesToRead, 0,
&overlapped[iMaskedWriterPos]);

iWriterPos++;
iIOPos += iBytesToRead;
}

const int iMaskedReaderPos = iReaderPos & NumBlocksMask;

WaitForSingleObject(overlapped[iMaskedReaderPos].hEvent, INFINITE);

int iLeft = FileSize - iPos;
int iBytesToRead = iLeft > BlockSize ? BlockSize: iLeft;

memcpy(&g_buffer[iPos], blocks[iMaskedReaderPos], iBytesToRead);

iReaderPos++;
iPos += iBytesToRead;

}
while(iPos < FileSize);

CloseHandle(hFile);

for(int i = 0; i < NumBlocks; i++)
{
VirtualFree(blocks[i], BlockSize, MEM_COMMIT);
CloseHandle(overlapped[i].hEvent);
}

Stats	Min (MB/s)	Max (MB/s)	Average (MB/s)
HDD	48.239	48.852	48.700
DVD	2.384	2.408	2.399

We are getting the same performance that before for Hard Disk but we go below memory mapped files for DVD. Anyway, there is something interesting that we are getting with this technique: stability (less difference between min and max values)

5. Data Compression

We can improve the performance if we put to good use the CPU (mostly idle in the previous tests). Data Compression gives work to the CPU while the disk is reading and allows to reduce the amount of data to be transferred. Ideally we should decompress faster than the I/O operation so that we get the decompression for free (in the sample code, you can detect when the I/O have to wait for the CPU activating the LOG_IO_STALLS macro).

I tried LZO and ZLIB to compress the data of the test. LZO is faster than ZLIB but it doesn’t allow streaming data so I chose ZLIB. The 100Mb file was compressed down to 64Mb.

Let’s see how each test benefits from the compression:

HardDisk	Min (MB/s)	Max (MB/s)	Average (MB/s)	Improvement %
ANSIC	59.067	69.348	67.558	+39%
W32	67.797	69.396	68.446	+41%
MMapped	52.247	53.562	52.980	+10%
Async I/O	68.634	69.832	69.242	+42%

DVD	Min (MB/s)	Max (MB/s)	Average (MB/s)	Improvement %
ANSIC	2.469	2.476	2.472	+3%
W32	3.437	3.467	3.455	+44%
MMapped	3.713	3.724	3.720	+47%
Async I/O	3.464	3.475	3.470	+44%

Definitely, compression is a win. We are almost getting the theoretical improvement (56% for a 64 / 100 compression) in all the tests except Memory Mapped in Hard Disk (while in DVD is getting the best performance).

6. Conclusion

We have reviewed several ways to load files from disk and discovered that compression give us a performance boost. Async I/O with compression is my recommended way to load files as fast a possible due to:

Easy to code. We don’t have to deal with threads
Although it is not the best in all the scenarios, it gives a good average performance in all of them
Disk is always busy without entering and leaving from kernel mode. This is vital when reading from devices such as DVD where seek times are costly. This feature is really important when you stream lot of files from DVD. Although if you are reading from DVD you should be bundling your data files, but that is another story

Although Async I/O doesn’t need another thread, I recommend reading and writing with Async I/O in another thread in charge of decompressing and parsing while the main thread is consuming already loaded items.

And that is all for this article. We got a fast technique for reading data from files. The second part of this article will be about preparing the data to be preprocessed as fast as posible (using Async I/O and compression of course). The trick? Avoiding preprocessing and parsing at all: inplace data loading.

You can discuss this article in the following blog entry: http://entland.homelinux.com/blog/2006/10/25/reading-files-as-fas-as-possible/

Thanks for reading.

Related Topics

Related Topics

Recent in More

Related Topics

Fast File Loading (Pt. 1)

1. The Standard and Portable way

2. The Win32 Native way

3. File Memory Mapping

Latest News

Trending

Cooking Games Spotlight: Deep Dives, Interviews, and More

Featured Blogs

Game Developer Essentials

Related Topics

Related Topics

Recent in More

Related Topics

<span class="ArticleBase-LargeTitle">Fast File Loading (Pt. 1)</span>Fast File Loading (Pt. 1)

1. The Standard and Portable way

2. The Win32 Native way

3. File Memory Mapping

Latest News

Trending

Cooking Games Spotlight: Deep Dives, Interviews, and More

Featured Blogs

Game Developer Essentials

Fast File Loading (Pt. 1)