How (and why) we ported SHENZHEN SOLITAIRE to MS-DOS

Can two programmers accustomed to making games for modern computers with gigabytes of RAM and high-color HD displays port one of their games to MS-DOS? Neither of us had any experience developing on such old hardware, but we felt compelled to try!

Zach Barth, Blogger

August 28, 2017

20 Min Read

Keith Holman and Zach Barth are developers at the game studio Zachtronics. If you like obscurely technical topics like writing games for 25-year-old computers, you’ll probably like our “puzzle games for programmers,” including SpaceChem, Infinifactory, TIS-100, and SHENZHEN I/O!

Getting Started

Can two programmers who are accustomed to making games for modern computers with gigabytes of RAM and high-color HD displays port one of their games to MS-DOS? Neither of us had any experience developing on such old hardware, but since working within artificially limited systems is something of a Zachtronics game design specialty, we felt compelled to try!

The project began when Zach created a mockup of SHENZHEN SOLITAIRE, the solitaire minigame from our game SHENZHEN I/O (also available as a standalone game), as it might look running on a 256-color VGA display:

It looks just like you would find on a PC in the early 90’s! From there it was simply a matter of writing the code. Right?

The Development Environment

First we had to figure out how to write anything that would run on an ancient DOS computer. Our target hardware was an appropriately vintage IBM-compatible PC from Zach’s collection:

Intel 80386SX CPU at 20 MHz
5 MB of RAM
VGA card
3½ inch floppy disk drive
Serial mouse
MS-DOS version 6.22

The only reasonable choice of programming language for a machine of this age was C. We weren’t going to write the entire game in x86 assembly language! After considering various choices of tools, we settled on Borland C++ 3.1, published in 1992. Running on DOSBox, Borland C++ provided a convenient and accurate emulation of the target machine.

Graphics

Computers with VGA graphics had a couple different drawing modes. Mode 13h was a popular and easy option: 320 x 200 pixel resolution with 256 colors. A more difficult choice was to use the unofficial “Mode X”, which had a higher resolution of 320 x 240 pixels. There are many tutorials on how to use Mode 13h on the internet, and it’s straightforward to program: the 320x200 pixels are represented by a 64,000 byte array, with each byte representing one pixel. Mode X is more obscure and difficult to use, but has certain advantages: Its resolution is 320 by 240, which is one of the few VGA resolutions with square pixels. It also supports drawing with 256 colors chosen from a palette, the most colors you can use in any VGA mode. Because we wanted better graphics and because we enjoy making things hard for ourselves, we decided to use Mode X.

So now we have a graphics mode with lots of pixels and colors (for a computer of that time, anyway). What are we going to draw? The most important categories were:

Full-screen backgrounds (desktop wallpaper, the green card-playing table)
Cards (including card icons and card numbers in three suit colors)
Text (for various buttons and labels)

We already had high-resolution, high-color versions of all these assets from the original version of SHENZHEN SOLITAIRE, but they needed to be converted to a much lower resolution and to use no more than 256 colors in total. There was no clever trick for this conversion process, just a few hours of work in Photoshop manually redrawing cards, symbols, and interface elements and scaling down the resolutions and color palettes of the backgrounds.

Here’s where the major downside of Mode X became relevant. Representing 320 x 240 pixels requires 76,800 bytes. VGA cards have 256 kilobytes of video memory, but divided into four “planes” of 64 kB each, which you can only access one at a time*. This works fine for Mode 13h which only requires 64,000 bytes, but Mode X has to split its video data across multiple planes.

*It’s a little more complicated than this: in some VGA modes, including Mode 13h, you can access multiple planes at once, but doing so comes with other downsides. Mode X limits you to one plane at a time.

At this point the graphics code started to get complicated, so we consulted the the Graphics Programming Black Book by Michael Abrash, the ultimate reference for all things VGA. As the Black Book explains, Mode X divides its pixel data across all four planes; each plane stores a quarter of the pixels, in an interleaved scheme. Plane 0 stores pixels 0, 4, 8, 12, etc. Plane 1 stores pixels 1, 5, 9, 13, and so on.

This is a classic situation in game programming: We know what output we need to produce, and the choice of how to structure the input data (in this case, the images) will have an enormous effect on the complexity and performance of the rendering code. Thankfully we didn’t have to figure it all out ourselves; Abrash has plenty of useful advice on efficiently drawing in Mode X. Since the entire screen is divided into the four planes, and it’s relatively slow to switch between planes, the most convenient (and fastest!) option is to break each image into four blocks of data so that each plane’s worth of data is in one contiguous chunk. This makes the code extremely simple, and also just about as fast as possible. Here’s the code that draws a full-screen (320 x 240) image into VGA memory:

// Code sample: Drawing an image to VGA memory
void far *vgaMemory = MAKE_FAR_POINTER(0xA000, 0x0000);
short bytesPerPlane = (320 / 4) * 240;
for (plane = 0; plane < 4; plane++)
{
SetCurrentVgaPlane(plane);
_fmemcpy(vgaMemory, imageData, bytesPerPlane);
imageData += bytesPerPlane;
}

This code snippet also demonstrates some of the quirks you have to deal with when writing a 16-bit DOS program. With 16-bit pointers, you can only directly address 64 kB of memory. (These are “near” pointers.) However, most DOS computers have much more than 64 kB of memory, and the addresses that correspond to VGA memory take up 64 kB already! In assembler, you deal with this using segment registers; in C, you typically use “far pointers”, which are a 32-bit pointer type that lets you address up to 1 megabyte. (Life becomes much easier once 32 bit processors show up, because then you can represent memory addresses up to 4 gigabytes without worrying about segment registers.)

Fortunately, 64 kB is quite a lot of memory, and nearly the entire game fits within that limit. There are only two parts of the code that require the use of far pointers. The first I’ve already mentioned: VGA memory is mapped to an entire 64 kB range at address 0xA0000. The second chunk was the image data. After converting all our art to low resolution and one byte-per-pixel, we had a total of about 250 kB of image data stored in one big file. Being more than 64 kB, this also required a far pointer. It also marked the one and only instance in the game’s code where we used dynamic memory allocation…

Memory Management

A major source of bugs and complexity in many C programs arises from managing dynamically allocated memory. In Shenzhen Solitaire, we kept things simple: we knew exactly how many cards we needed to keep track of, so we simply allocated the memory for them up front. There were no malloc() calls that could fail, and no free() calls for us to forget. (The same strategy applies to state throughout the rest of the game.) Here’s what the solitaire engine’s state looks like:

// Code sample: Declaring state for the solitaire engine
struct CardOrCell
{
byte Suit;
byte Value;
byte CellType;
struct CardOrCell *Parent;
struct CardOrCell *Child;
};

CardOrCell Cards[NUM_CARDS];
CardOrCell FreeCells[NUM_FREE_CELLS];
CardOrCell FoundationCells[NUM_FOUNDATION_CELLS];
CardOrCell TableauCells[NUM_TABLEAU_CELLS];
CardOrCell FlowerCell;
CardOrCell *DraggedCard;

As mentioned previously, the one place where we couldn’t use this strategy was for image data, because that was much larger than 64 kB. Instead, we stored all the images in one big file (with the image data divided into four planes, as detailed above) and load it in at startup while the loading bar is displayed:

// Code sample: Loading image data from file
// In the actual source code, this is more complicated and has
// more robust error handling.
FILE *f = fopen("SHENZHEN.IMG", "rb"); assert(f);
long size = GetSizeOfFile(f);
ImageData = farmalloc(size); assert(ImageData);
long loaded = 0;
byte huge *dest = ImageData;
while (loaded < size)
{
long result = fread(dest, 1, 1024, f); assert(result != 0);
loaded += result;
dest += result;
// (update the loading bar)
}
fclose(f);

The memory allocated for image data is never freed since we use it for the entire duration of the program.

Here’s a fun detail of the loading code: In the loading screen we display a boot logo above the loading progress bar. However, we are drawing the loading screen before we’ve read in the image data! To work around this circular dependency, we added a special rule to the image-data-packing program to put the image data for the boot logo first. It’s loaded within the first second or two of starting the game, and then it can be shown for the rest of the boot process.

// Code sample: Drawing the boot screen logo
bool logoDrawn = false;
while (loaded < size)
{
...
// Once the boot screen background has loaded, draw it:
if (!logoDrawn && loaded > ImageOffsets[IMG_BOOT_BACKGROUND + 1])
{
Blit(IMG_BOOT_BACKGROUND, 101, 100);
logoDrawn = true;
}
}

Image Preprocessing

When drawing an image, how do you locate and use its data? All the images used in the game were stored in PNG files created with modern software. These PNGs were then processed by a special content baking program. The content baker transformed the PNGs into a format easily consumed by the 16-bit DOS program by producing three files: a large binary file that contained all the actual pixel data as palette indices, a smaller binary file containing the palette of 256 colors shared by all of the images, and a C header file containing metadata allowing images to be named and their data found. Here’s what the metadata looks like:

// Code sample: Image metadata, automatically generated by
// the content baking tool
#define IMG_BUTTON       112
#define IMG_BUTTON_HOVER   113
#define IMG_CARD_FRONT       114
// More image IDs...

long ImageOffsets[] = { 0L, 4464L, 4504L, 4544L, ... };
short ImageWidths[] = { 122, 5, 5, 5, ... };
short ImageHeights[] = { 36, 5, 5, 5, ... };

byte huge *ImageData = /* image data loaded from file */

For each PNG file processed by the content baker, it #defines an ID (such as IMG_CARD_FRONT) and records its width, height, and the location of its data. To draw an image, we call a drawing function such as Blit(IMG_CARD_FRONT, x, y). The drawing function then calls ImageInfo(IMG_CARD_FRONT, ...) to get the image’s data and metadata.

// Code sample: Look up metadata for an image
void ImageInfo(short imageID, short *w, short *h, byte huge **image)
{
assert(imageID >= 0 && imageID < IMAGE_COUNT);
*w = ImageWidths[imageID];
*h = ImageHeights[imageID];
*image = ImageData + ImageOffsets[imageID];
}

Low-Level Optimization: A Little Bit of Assembly

Once we finished implementing the graphics functionality, the solitaire gameplay logic came together quickly, and everything looked good… but it was very slow. As always, the only way to optimize effectively is to first profile your code. Borland C++ includes the Turbo Debugger, which is a remarkably useful profiler considering that it’s 25 years old:

In this case, the profiling results weren’t very surprising. The program was spending nearly all its time in drawing routines, copying image data into VGA memory. Copying hundreds of thousands of pixels per second on a 20 MHz 386 is a very demanding task! Examining the assembly generated by the Borland compiler made it clear that there was a lot of overhead slowing down the inner loops of our drawing routines:

While we wanted to stick to writing C as much as possible, it was clear that some assembly would be necessary when it came to optimizing the critical inner loops of drawing routines. Usage of assembly was common in 90’s game programming, when processors were slower and compiler optimizations were less clever. So we rewrote just the innermost parts of our drawing functions in inline assembly. This is the inline assembly code at the heart of the Blit() function, which is used for the majority of drawing in the game:

// Code sample: Drawing with inline assembly
byte far *vgaPtr = /* address of destination VGA memory */;
byte far *imgPtr = /* address of source image data */;
asm PUSHA
asm PUSH DS
asm LES DI, vgaPtr
asm LDS SI, imgPtr
asm MOV BX, h
asm CLD
each_row:
asm MOV CX, lineWidth
asm REP MOVSB // This instruction does all the work!
asm ADD DI, dstSkip
asm ADD SI, srcSkip
asm DEC BX
asm JNZ each_row
asm POP DS
asm POPA

There’s a lot of C code preceding this assembly code that calculates where to draw, where to read image data from, and how to clip the drawn image to the visible area, but the vast majority of actual execution time is spent in the instruction REP MOVSB. This instruction is the equivalent of C’s memcpy(), but natively supported by x86 processors, and is the fastest* way to copy memory on early x86 processors like the 386. Just like memcpy(), you specify a source pointer, a destination pointer, and a count (in registers) and then the processor automatically loops until that many bytes have been copied.

SHENZHEN SOLITAIRE for MS-DOS only uses inline assembly in three places, and they’re all very similar to this snippet of code. The use of inline assembly in these critical drawing functions increased the speed of these functions by a factor of 5-10x. Despite the dozens of lines of setup code, and all the other code that runs in the game, the vast majority of CPU time was spent simply copying bytes.

*Actually, REP MOVSW or REP MOVSD would have been faster, since they copy 2 or 4 bytes at a time instead of only 1. However, our version of Borland C++ only supported 16-bit instructions, which ruled out MOVSD. We also tried using MOVSW, but had some difficulty getting it to work correctly in all cases. Even if REP MOVSW had made the drawing routines fully twice as fast, that still wouldn’t have been enough to solve our higher-level performance issues.

Double Buffering

The next issue to solve was flickering.

In Mode X, 75 kB of VGA memory is required to describe a full screen of graphics. But the VGA’s 256 kB of memory has enough room to store three “screens” worth of graphics at once. In our code, we imaginatively labelled these Screen 0, Screen 1, and Screen 2. So far we were using only Screen 0. To implement double buffering, we used Screens 0 and 1; at any given time, one screen is the “frontbuffer”, what’s actually displayed, while the other acts as the “backbuffer,” the area on which the game draws new or changed elements. When we’re ready to display the new frame of graphics, we perform a “page flip,” a VGA feature that instantly changes what section of memory is displayed.

Double buffering solved the flickering problem, but the game was still too slow. Merely fine-tuning the low level drawing code wasn’t enough. As usual, this is anticipated by Abrash. A major theme of the Graphics Programming Black Book is that selecting appropriate algorithms and high-level program design yields bigger performance benefits than low-level assembly coding tricks.

High-Level Optimization: The Static Background

Our first observation was that, for most of the game, much of the screen remains the same from one frame to the next.

The mouse cursor was usually moving around, and perhaps a few cards were being dragged, but the background and most of the cards were completely unchanged. To take advantage of this, we allocated our last screen-full of VGA memory, Screen 2, to be the “static background.” We would draw any UI elements that rarely changed— which is almost everything except the mouse cursor— to Screen 2. Then, the drawing procedure for each frame would typically be:

Copy the static background into the backbuffer.
Draw the mouse (and any dragged cards) into the backbuffer.
Flip the backbuffer to be visible.

This avoids quite a bit of work on most frames, but copying the entire static background is still a lot of work. Worse, when the static background does change— for example, when a card is picked up with the mouse— the static background first has to be completely redrawn in addition to the other per-frame work. This meant that the game normally ran smoothly, but there was a noticeable and unacceptable hitch when clicking on cards or buttons.

At this point in development, we had two main performance costs: redrawing the static background (periodically), and copying the static background to the backbuffer (every frame). We found a simple solution to reduce the cost of redrawing the background: each stack of cards has its own region of the screen, and typically only one or two stacks need to be redrawn in any frame. This meant only about 10-20% of the static background had to be redrawn, which took proportionally less time.

High-Level Optimization: Dirty Rectangles

The final issue that prevented the game from running with an acceptably high framerate was the copy of the static background that occurred on every frame. Our solution to this was another classic technique from the time before hardware-accelerated graphics: dirty rectangle tracking. You might recognize “dirty rectangles” from Windows 95 whenever a program froze, a not so uncommon experience. When that happened, you could grab another window and drag it across the frozen program’s window until it was blank or covered in graphical garbage. The areas that you’d overwritten by dragging the window across them were “dirty”, and the frozen program was supposed to be redrawing them as soon as you moved the foreground window away.

We used dirty rectangles in a similar way. Whenever the player moves the mouse, or part of the static background changes, we add that region to the list of dirty rectangles. Then, instead of copying the entire static background into the backbuffer, we copy only the dirty rectangles. Once again, copying fewer pixels means that it takes less time to draw the frame, which improves the framerate.

With this change, the game finally ran smoothly on our test machine. We decided that most VGA-equipped computers would have higher specs than our decidedly sluggish 20 MHz 386, so we felt confident that the game’s performance was good enough to ship.

Input

Compared to graphics, all other aspects of the game were refreshingly straightforward. On IBM-compatible computers, most of the basic services you need are provided by the BIOS and by DOS through an interrupt-based API. These services include reading keyboard and mouse input, reading and writing from disks, opening files, and allocating memory. Here are some interrupts we used in Shenzhen Solitaire:

INT 16h: Keyboard API
INT 21h: DOS system calls
INT 33h: Mouse API

These APIs are intended to be used from assembly code, like so:

// Code sample: Calling BIOS functions with assembly
// Use INT 16h, 1h to get the last key pressed.
mov AL, 16h
mov AH, 01h
int 16h
// The zero flag will be set if a key was pressed.
jnz no_key_pressed
// Process the key input…
no_key_pressed:
// Continue the program...

Reading the mouse status works in a similar way; call “INT 33h, 3h” to get the current mouse position and button status. Compared to accessing hardware directly, or even using a modern OS API, this was incredibly easy.

Audio

Sound effects are an essential part of any game, but playing music and sound on very old PCs is challenging. The simplest option, present on all IBM-compatible PCs, is the “PC speaker”. This speaker is driven by a simple timer chip on the motherboard. This means that the speaker can very easily be configured to continuously produce a tone with a constant frequency. Some PCs also have dedicated sound cards, but they are much more difficult to use and less consistently available.

Given the limitations of the PC speaker, we only play short “music” clips at a few points during the game: startup, shutdown, and after winning a game. 386-era PCs offer few facilities for measuring time precisely, which makes it difficult to play even simple music on the PC speaker while simultaneously doing other things. Once again, we chose the simpler option: since our music clips are very short, we simply freeze the game for the one or two seconds it takes to play the musical effect. The code to define and play one of these clips is straightforward:

void PlayStartupMusic(void)
{
   sound(73);
   delay(220);
   sound(110);
   delay(220);
   sound(165);
   delay(220);
   sound(98);
   delay(220);
   sound(110);
   delay(220);
   sound(147);
   delay(440);
   nosound();
}

In addition to the musical clips, we also wanted to add a sound effect that plays when cards are picked up and dropped to make them feel more physical. This was also implemented with the PC speaker, by temporarily disabling the automatic timer-based tone generator and manually toggling it on and off once to produce a momentary “click”.

Ship It!

Porting SHENZHEN SOLITAIRE to MS-DOS was both easier and harder than we expected. Despite the huge disparity in CPU power between our target hardware (single-core Intel 80386SX at 20 MHz) and our usual hardware (quad-core Intel Core i7-4770K at 3500 MHz), the unoptimized gameplay logic ran lightning fast and had little impact on performance. The downside, of course, being that anything is lightning fast when it takes almost 150 milliseconds to draw a screen’s worth of pixels! In the process we learned a lot about a computer architecture that could now arguably be considered esoteric. Although it was inspiring and informative, we’re unlikely to include segment registers in any future Zachtronics game. We’re not that mean!

If you’re interested in playing the MS-DOS port of SHENZHEN SOLITAIRE and happen to be reading this before September 12th, 2017, consider checking out our Kickstarter project and ordering a copy of the game as part of our floppy-only release. The full source code is included!

About the Author(s)

Zach Barth

Blogger

Zach Barth runs the independent game studio Zachtronics Industries and worked on SpaceChem as the lead designer and producer.

See more from Zach Barth

Related Topics

Related Topics

Recent in More

Related Topics

Related Topics

How (and why) we ported SHENZHEN SOLITAIRE to MS-DOS

Getting Started

The Development Environment

Graphics

Memory Management

Image Preprocessing

Low-Level Optimization: A Little Bit of Assembly

Double Buffering

High-Level Optimization: The Static Background

High-Level Optimization: Dirty Rectangles

Input

Audio

Ship It!

About the Author(s)

Latest News

Trending

Featured Blogs