Optimizing a 6502 Image Decoder: From 70 Minutes to 1 Minute The author set out to create a program for basic digital photography on the Apple II, using Apple's Quicktake cameras due to their serial port interface. The project evolved from decoding Quicktake 100 images to also supporting Quicktake 150 and 200 formats, involving significant image processing challenges on the 6502 processor running at 1MHz. --- Context and Challenges Quicktake 150 format: Proprietary and undocumented, initial decoding efforts relied on the dcraw project’s C code, which is complex, poorly documented, uses Huffman compression with variable-length codes, and requires heavy 16-bit math operations—difficult to manage efficiently on a 6502. Initial result: Functional decoder worked but took 70 minutes to decode a single image. The author’s approach prioritized algorithmic optimization over hand-tuning assembler, emphasizing doing fewer tasks faster rather than many tasks slightly faster. --- Key Optimization Steps and Their Impact Dropping Color Removed decoding of blue and red pixels; only green pixels from the Bayer matrix were decoded. Resulted in reduction from 301 million instructions to around 264 million on x8664 emulation. Buffer Optimization Analyzed and minimized temporary buffers to reduce copying and looping. Unrolled nested loops working on small pixel bands (y [1,2], x [col+1,col]). Removed unused buffers and conditional complexity by dropping color logic entirely. Transitioned from an intermediary 640×480 Bayer matrix output with interpolation to directly producing a 320×240 grayscale image with relevant pixels only. This cut instructions from 238M to 25M in emulated runs. Understanding Buffer Roles Identified that only one buffer (bufm[1]) was essential for image construction. Removed other buffers and loops, constructing the image "on the fly" to avoid unnecessary passes. Yielded clarity and lowered instruction count to about 22 million. Division Optimization Pixel values required division by a factor changing every two rows. Precomputed a division lookup table storing pre-clamped final results, turning 153,600 divisions into less than 2,000. On the 6502 this resulted in a huge speed gain, even if instruction count on x8664 was similar. Output Buffer Indexing Replaced slow buffer[yWIDTH + x] accesses with simpler line-by-line indexing. This optimization alone reduced instruction counts by 2 million on x8664 and drastically improved 6502 performance, which lacks hardware multiplication. Huffman Decoding Improvement Original method involved full tables for variable-length Huffman codes and 16-bit bitbuffers. Optimized to read bits one at a time, which slowed x8664 slightly but sped up the 6502 implementation by 20 seconds (from 29s to 9s). Allowed table size reduction and memory savings important on limited hardware. --- Assembly-Level and Additional Optimizations Final implementation is hand-optimized 6502 assembly, significantly more efficient than cc65 compiler output. Ad-hoc tricks include: Two division tables handled differently based on common divisor factors for speed. Multiplications by 255 replaced by fast calculations (a<<8)-a. Lookup tables for shift operations like <<4 split into two tables but faster than computation. "Discarer" functions used when decoding Huffman codes that do not produce pixels (blue and red pixels), saving processing. Buffer accesses patched through self-modifying code rather than slower indirect addressing, saving millions of cycles despite complexity. --- Results and Code Availability The optimized decoder reduced decoding time from about 70 minutes to 1 minute on the 6502 processor. The approach shows how focusing on algorithm simplification and minimal necessary processing yields far better results than aggressive