The Magic 0xC2

I built a web application with file upload functionality. Some Vue.js in the front and a CouchDB in the back. Everything should be pretty simple and straigt forward. But…
Last updated:

When I uploaded image files, they somehow got mangled. The uploaded file was bigger than the original and the new "file format" was not readable by any means. I got intrigued. What is it, that happens to the files? The changes seemed very random but reproducible, so I created a few test files to see what exactly changes and when.

My first file looked like this:

0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz

To my surprise, the file stayed the same! My curiosity grew. In the meantime I found a very intriguing pattern in uploads hexdump: C3 BF C3. It was everywhere. In another file, I found similar patterns with C2. So I wrote my next test file. This time a binary file:

00 01 02 03 04 05 06 07  08 09 10 11 12 13 14 15 |................|
16 17 18 19 20 21 22 23  24 25 26 27 28 29 30 31 |.... !"#$%&'()01|
32 33 34 35 36 37 38 39  40 41 42 43 44 45 46 47 |23456789@ABCDEFG|
48 49 50 51 52 53 54 55  56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc|
64 65 66 67 68 69 70 71  72 73 74 75 76 77 78 79 |defghipqrstuvwxy|
80 81 82 83 84 85 86 87  88 89 90 91 92 93 94 95 |................|
96 97 98 99 a0 a1 a2 a3  a4 a5 a6 a7 a8 a9 aa ab |................|
ac ad ae af b0 b1 b2 b3  b4 b5 b6 b7 b8 b9 ba bb |................|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|

EDIT: As you probably already noticed, I counted up like in Base10 but it is actually Base16. So I skipped A-F until reaching A0. This might look weird but didn't affect the test.

The result after uploading was

00 01 02 03 04 05 06 07  08 09 10 11 12 13 14 15  |................|
16 17 18 19 20 21 22 23  24 25 26 27 28 29 30 31  |.... !"#$%&'()01|
32 33 34 35 36 37 38 39  40 41 42 43 44 45 46 47  |23456789@ABCDEFG|
48 49 50 51 52 53 54 55  56 57 58 59 60 61 62 63  |HIPQRSTUVWXY`abc|
64 65 66 67 68 69 70 71  72 73 74 75 76 77 78 79  |defghipqrstuvwxy|
c2 80 c2 81 c2 82 c2 83  c2 84 c2 85 c2 86 c2 87  |................|
c2 88 c2 89 c2 90 c2 91  c2 92 c2 93 c2 94 c2 95  |................|
c2 96 c2 97 c2 98 c2 99  c2 a0 c2 a1 c2 a2 c2 a3  |................|
c2 a4 c2 a5 c2 a6 c2 a7  c2 a8 c2 a9 c2 aa c2 ab  |................|
c2 ac c2 ad c2 ae c2 af  c2 b0 c2 b1 c2 b2 c2 b3  |................|
c2 b4 c2 b5 c2 b6 c2 b7  c2 b8 c2 b9 c2 ba c2 bb  |................|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

There it was again: The magic 0xC2!

So all bytes with a value higher than 0x79 got followed by a 0xC2. 0x79 is the ASCII code for y. This is at least what I thought. It actually is the other way around: All bytes with value 0x80 or higher got prefixed by a 0xC2! — there the scales fell from my eyes: UTF-8 encoding!

In UTF-8 all characters after 0x7F are at least two bytes long. They get prefixed with 0xC2 until 0xC2BF (which is the inverted question mark ¿), which is then followed by 0xC380. So what happened is, that on the way to the server, the file got encoded to UTF-8 ¯\_(ツ)_/¯

EDIT: Corrected some mistakes after some comments on Hackernews