I believe I had a good start analyzing the file, although I'm currently slightly stuck on the exact details of the encoding.
Spoilers ahead for people who want to try things entirely by themselves.
My initial findings were that the raw file easily compresses from 2100086 bytes to 846149 bytes with zpaq -m5, probably hard to beat its context modeling with other compressors or even manual implementations.
I wrote a Python script to reverse the intern's transformation and analyzed the bits, looks like the file is largely organized into septets of data. I dumped those septets in all the ways that made sense (inverted bits, inverted order of bits) into a file. None of them compressed better than the source.
I opened those files with GIMP as raw data files, and noticed that at 4000x600 8-bit grayscale with a small offset for the header it looks like some sort of 2D data, but it's extremely noisy. There are lots of very flat areas in the file, though, which is why it seems to compress so well. Likely a threshold of some sort on the sensor data.
By only taking every fourth septet I could form four 1000x600 grayscale images. One of them seems to have actual structure (I presume it's the most significant septet), the rest are dominated by noise.
It's recognizable as a picture of a piece of crystalized bismuth. (https://i.imgur.com/rP5sq56.png) :::
There also seems to be some sort of pattern in the picture pixels similar to the raw data behind a Bayer filter in a camera. It might be a 2x2 or a 4x4 filter, I can't quite tell. It could also just be some sort of block artifacts from the picture source (which wouldn't be present in raw sensor data from aliens, but the author needs to source this image somehow).
The binary encoding seems to involve a sign bit in each septet if I'm interpreting things correctly, but I'm not sure how to combine septets into a sensible value yet.
I'm currently stuck at this point and won't have time to work on it more this evening, but seems like good progress for one and a half hour.
I sat down before going to bed and believe I have made some more progress.
I experimented with what I called the sign bit in the earlier post, and I'm certain I got it wrong. By ignoring the sign bit, I can reconstruct a much higher fidelity image. I can also do a non-obvious operation of rotating the bit to least significant place after inverting. I can't visually distinguish the two approaches, though.
I wrote a naive debayering filter and got this image out: https://i.imgur.com/e5ydBTb.png (bit rotated version, 16-bit color RGB. Red channel on even rows/columns is exact, blue channel on odd rows/columns is exact, green channel is exact elsewhere.)
You can reverse image search that image to find that it's a standard stock photo of bismuth, example larger but slightly cropped version: http://blog-imgs-98.fc2.com/s/c/i/scienceminestrone/Bismuth.jpg
Finding the original image would definitely help, even if it wouldn't fit the spirit of this challenge.
I haven't yet tried going in the reverse direction and seeing how much space this saves - doing this properly is tricky. I'm not aware of any good, ready-to-use libraries that provide things like context modeling and arithmetic coding, so writing a custom compressor is a lot of work.
In fact, I believe it may be worth trying to break the author's noise source on the sensor. Most programming languages use a fairly breakable PRNG, either a xorshift variant or an LCG. But this may be a dead end if cryptographic randomness was used. Again, this wouldn't fit the spirit of this challenge, but it would minimize description length if it worked.
I have further discovered that in the bulk of the data, the awkward seventh bit is not in fact part of the values in the image, it is a parity bit. My analysis was confused by counting septets from the beginning of the file, which unfortunately seems incorrect.
Analyzing bi-gram statistics on the septets helped figure out that what I previously believed to be the 'highest' bit is in fact the lowest bit of the previous value, and that value always makes the parity of the septet even.
I was trying to ignore the header for now after failing to find values corresponding to the image width and height, but it looks like it's biting me in the ass right now.
Analyzing the file more carefully:
The first 78 bits are the 'prelude', presumably added by the transmission system itself to estabilish a clock for the signal.
All following bits are divided into septets, with each septet's lowest (last) bit being a parity check.
But: the first 37 septets have inverted (odd) parity (It's possible it's metadata for the transmitting system itself).
The next 50 septets constitute some kind of header (with even parity).
The remaining septets are image data (with even parity), four septets per pixel, most significant first. This means the image has 24-bit depth.
There's a rogue zero bit at the end of the transmission to make things a multiple of 8 bits. Note that the intern's code does not explain the last zero bit. The code would not output extra data at the end, it would truncate the file if anything, so the zero bit must have been part of the transmission.
Curiously, by removing the data mentioned in the spoiler and packing everything into nice big-endian values, zpaq compresses the file worse than it does the original file. Just the main section of the file compresses to 850650 bytes.
Morning progress so far:
I figured out how the values (and the noise) are generated.
The source image is an 8-bits per pixel color image, the source pixel value is chosen from one of the color channels using a bayer filter, with a blue filter at 0, 0.
The final value is given by: clamp(0, 2**24-1, (source_value * 65793 + uniform(0, 1676)))
, where uniform(x, y) is a uniformly chosen random value between x and y inclusive.
Without cracking the noise source, the best we can do to encode the noise itself is 465255 bytes.
...because 347475 pixels in the image have non-255 values, and log2(1677^347475) = 3722036.5 bits.
The best I could do to encode the bulk of the data is 276554 bytes, again with the general purpose zpaq compressor.
Aside on attempted alternative compression methods for the bulk of the data:
Image compression did quite poorly here, I thought adam7 interlacing would help compression due to the bayer filter pattern, but png did not perform well. With zopflipng + pngcrush the best I could achieve was 329939 bytes.
This gives an approximate lower bound of 741809 bytes without either modeling the actual data better, or cracking the noise source. This does not include the data needed to describe the decompressor and glue all the data together into the original bitstream.
My understanding of faul_sname's claim is that for the purpose of this challenge we should treat the alien sensor data output as an original piece of data.
In reality, yes, there is a source image that was used to create the raw data that was then encoded and transmitted. But in the context of the fiction, the raw data is supposed to represent the output of the alien sensor, and the claim is that the decompressor + payload is less than the size of just an ad-hoc gzipping of the output by itself. It's that latter part of the claim that I'm skeptical towards. There is so much noise in real sensors -- almost always the first part of any sensor processing pipeline is some type of smoothing, median filtering, or other type of noise reduction. If a solution for a decompressor involves saving space on encoding that noise by breaking a PRNG, it's not clear to me how that would apply to a world in which this data has no noise-less representation available. However, a technique of measuring & subtracting noise so that you can compress a representation that is more uniform and then applying the noise as a post-processing op during decoding is definitely doable.
Assuming that you use the payload of size 741809 bytes, and are able to write a decompressor + "transmitter" for that in the remaining ~400 KB (which should be possible, given that 7z is ~450 KB, zip is 349 KB, other compressors are in similar size ranges, and you'd be saving space since you just need to the decoder portion of the code), how would we rate that against the claims?
- It would be possible for me, given some time to examine the data, create a decompressor and a payload such that running the decompressor on the payload yields the original file, and the decompressor program + the payload have a total size of less than the original gzipped file
- The decompressor would legibly contain a substantial amount of information about the structure of the data.
(1) seems obviously met, but (2) is less clear to me. Going back to the original claim, faul_sname said 'we would see that the winning programs would look more like "generate a model and use that model and a similar rendering process to what was used to original file, plus an error correction table" and less like a general-purpose compressor'.
So far though, this solution does use a general purpose compressor. My understanding of (2) is that I was supposed to be looking for solutions like "create a 3D model of the surface of the object being detected and then run lighting calculations to reproduce the scene that the camera is measuring", etc. Other posts from faul_sname in the thread, e.g. here seem to indicate that was their thinking as well, since they suggested using ray tracing as a method to describe the data in a more compressed manner.
What are your thoughts?
Regarding the sensor data itself
I alluded to this in my post here, but I was waffling and backpedaling a lot on what would be "fair" in this challenge. I gave a bunch of examples in the thread of what would make a binary file difficult to decode -- e.g. non-uniform channel lengths, an irregular data structure, multiple types of sensor data interwoven into the same file, and then did basically none of that, because I kept feeling like the file was unapproachable. Anything that was a >1 MB of binary data but not a 2D image (or series of images) seemed impossible. For example, the first thing I suggested in the other thread was a stream of telemetry from some alien system.
I thought this file would strike a good balance, but I now see that I made a crucial mistake: I didn't expect that you'd be able to view it with the wrong number of bits per byte (7 instead of 6) and then skip almost every byte and still find a discernible image in the grayscale data. Once you can "see" what the image is supposed to be, the hard part is done.
I was assuming that more work would be needed for understanding the transmission itself (e.g. deducing the parity bits by looking at the bit patterns), and then only after that would it be possible to look at the raw data by itself.
I had a similar issue when I was playing with LIDAR data as an alternative to a 2D image. I found that a LIDAR point cloud is eerily similar enough to image data that you can stumble upon a depth map representation of the data almost by accident.
I actually did not read the linked thread until now, I came across this post from the front page and thought this was a potentially interesting challenge.
Regarding "in the concept of the fiction", I think this piece of data is way too human to be convincing. The noise is effectively a 'gotcha, sprinkle in /dev/random into the data'.
Why sample with 24 bits of precision if the source image only has 8 bits of precision, and it shows. Then why only add <11 bits of noise, and uniform noise at that? It could work well if you had a 16-bit lossless source image, or even an approximation of one, but the way this image is constructed is way too artificial. (And why not gaussian noise? Or any other kind of more natural noise? Uniform noise pretty much never happens in practice.) One can also entirely separate the noise from the source data you used because 8 + 11 < 24.
JPEG-caused block artifacts were visible while I was analyzing the planes of the image, that's why I thought the bayer filter was possibly 4x4 pixels in size. I believe you likely downsampled the image from a jpeg at approximately 2000x1200 resolution, which does affect analysis and breaks the fiction that this is raw sensor data from an alien civilization.
With these kinds of flaws I do believe cracking the PRNG is within limits since the data is already really flawed.
(1) is possibly true. At least it's true in this case, although in practice understanding the structure of the data doesn't actually help very much vs some of the best general purpose compressors from the PAQ family.
It doesn't help that lossless image compression algorithms kinda suck. I can often get better results by using zpaq on a NetPBM file than using a special purpose algorithm like png or even lossless webp (although the latter is usually at least somewhat competitive with the zpaq results).
(2) I'd say my decompressor would contain useful info about the structure of the data, or at least the file format itself, however...
...it would not contain any useful representation of the pictured piece of bismuth. The lossless compression requirement hurts a lot. Reverse rendering techniques for various representations do exist, but they are either lossy, or larger than the source data.
Constructing and raytracing a NeRF / SDF / voxel grid / whatever might possibly be competitive if you had dozens (or maybe hundreds) of shots of the same bismuth piece at different angles, but it really doesn't pay for a single image, especially at this quality, especially with all the jpeg artifacts that leaked through, and so on.
I feel like this is a bit of a wasted opportunity, you could have chosen a lot of different modalities of data, even something like a stream of data from the IMU sensor in your phone as you walk around the house. You would not need to add any artificial noise, it would already be there in the source data. Modeling that could actually be interesting (if the sample rate on the IMU was high enough for a physics-based model to help).
I also think that viewing the data 'wrongly' and figuring out something about it despite that is a feature, not a bug.
Updates on best results so far:
General purpose compression on the original file, using cmix:
\time ./cmix -c /ztmp/mystery_file_difficult.bin /ztmp/mystery_file_difficult.cmix
Detected block types: DEFAULT: 100.0%
2100086 bytes -> 760584 bytes in 5668.77 s.
cross entropy: 2.897
5566.69user 105.07system 1:39:57elapsed 94%CPU (0avgtext+0avgdata 18968788maxresident)k
30749016inputs+5592outputs (3812631major+12008307minor)pagefaults 0swaps
Results with knowledge about the contents of the file: https://gist.github.com/mateon1/f4e2b8e3fad338405fa793fb155ebf29 (spoilers).
Summary:
The best general-purpose method after massaging the structure of the data manages 713248 bytes.
The best purpose specific method manages to compress the data, minus headers, to 712439 bytes.
(1) The first thing I did when approaching this was think about how the message is actually transmitted. Things like the preamble at the start of the transmission to synchronize clocks, the headers for source & destination, or the parity bits after each byte, or even things like using an inversed parity on the header so that it is possible to distinguish a true header from bytes within a message that look like a header, and even optional checksum calculations.
(2) I then thought about how I would actually represent the data so it wasn't just traditional 8-bit bytes -- I created encoders & decoders for 36/24/12/6 bit unsigned and signed ints, and 30 / 60 bit non-traditional floating point, etc.
Finally, I created a mock telemetry stream that consisted of a bunch of time-series data from many different sensors, with all of the sensor values packed into a single frame with all of the data types from (2), and repeatedly transmitted that frame over the varying time series, using (1), until I had >1 MB.
And then I didn't submit that, and instead swapped to a single message using the transmission protocol that I designed first, and shoved an image into that message instead of the telemetry stream.
[0, +N]
, when I should have allowed the noise to be positive or negative. I am less convinced that the uniform distribution is incorrect. In my experience, ADC noise is almost always uniform (and only present in a few LSB), unless there's a problem with the HW design, in which case you'll get dramatic non-uniform "spikes". I was assuming that the alien HW is not so poorly designed that they are railing their ADC channels with noise of that magnitude.Oh nice! Sorry for the slow reply. It looks like Mateon1 might have already solved it, but I'm going to take an independent crack at this before looking at the solution they came up with.
So far I've established that it appears to be
a pattern of bits that is 686 bits of header, followed by 500x600x56 bits of message, followed by 2 0's. The 500 and 600 length sides appearing to have neighboring cells have similar value / behavior, and the side of length 56 representing maybe channels. Every 7th channel seems to be different than the other 6. There is some easily visible structure in channels 1-4, 10-13, 29-32, and 38-41.
My progress so far is at https://imgur.com/a/SM0gY1o
Alright, I managed to figure out what the message is and where it came from. I have not managed to write a better compressor than gzip yet, and expect that doing so will require picking up some computer skills that I don't currently have. But I think it's something modern software and hardware is capable of.
Update on progress: https://imgur.com/a/6r7VcDV
Time to see how far behind Mateon1 I was.
Edit: Answer: extremely far behind. I am very impressed. Also I take back my statement that I could probably beat the best general compression algos here given a reasonable amount of time, because
it's a blurry picture of something that had lots of pretty diffraction patterns, which was then converted to jpeg. Deconvolution is a thing, and can sometimes recover information from blurry pictures, but the conversion to jpeg destroys some data and I don't think deconvolution algos are robust to lost data. And also the data that looked random actually was random.
I probably could tell you some details about the camera used, focal distance, etc though after some playing around with Blender. And someone who knows a lot more physics than I do could maybe tell you interesting things about the light source by looking at the interference patterns in the layer of bismuth oxide in the parts of the photo that aren't blurry.
Interesting choice of message to encode -- it's one that (aside from the lossy compression aspect) would actually be quite a bit more informative about the laws of physics in the universe it came from than a picture of a falling apple would be about the laws of physics in our universe.
Thanks for running this challenge.
This is my reply to faul_sname's claim in the "I No Longer Believe Intelligence To Be Magical" thread:
Here is such a file: https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
Unfortunately, the data in this file comes from eavesdropping on aliens.
Background on LV-478
It was a long overdue upgrade of the surface-to-orbit radio transmitter used for routine communication with the space dock. However, a technician forgot to lower the power level of the transmitter after the upgrade process, and the very first transmission of the upgraded system was at the new (and much higher) maximum power level. The aliens accidentally transmitted an utterly mundane and uninteresting file into deep space.
The technician was fired.
Hundreds of years later...
Background on Earth
Astronomers on Earth noticed a bizarre radio transmission from a nearby star system. In particular, the transmission seemed to use frequency modulation to carry some unknown data stream. They were lucky enough to be recording when this stream was received, and they are therefore confident that they received all of the data.
Furthermore, the astronomers were able to analyze the transmission and assign either a "0?-modulated" or "1?-modulated" value. The astronomers are confident that they've analyzed the frequencies correctly, and that the data carried by the transmission was definitely binary in nature. The full transmission contained a total of 16800688 (~2 MB) binary values.
The astronomers ran out of funding before they could finish analyzing the data, so an intern in the lab took the readings, packed the bits together, and uploaded it to the internet, in the hopes that someone else would be able to figure it out.
This is the exact Python code that the intern used when packing the data:
Challenge Details
Decode the alien transmission in the file uploaded at https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.