Add stats format specification #69

findepi · 2022-04-01T13:45:05Z

Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.

This follows earlier discussion in https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY

findepi · 2022-04-01T13:47:41Z

The format description is based on the doc (https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY) and has been preliminarily discussed with @rdblue @losipiuk @alexjo2144 and disseminated as a proposal on the Iceberg dev list and slack. New comments and feedback is welcome.

The description is added next to spec file, so within landing-page. Please advice whether this is the right place.
Also, it's not integrated with the spec yet. It seems to be that the table's reference to stats needs another look before being formalized. See doc link above.

landing-page/content/common/statistics-format-spec.md

rdblue · 2022-04-03T22:56:02Z

landing-page/content/common/statistics-format-spec.md

+
+| Field Name | Field Type             | Required | Description |
+| ---------- | ---------------------- | -------- | ----------- |
+| blobs      | list of Blob objects   | yes      |


How about list of BlobMetadata objects? That way this doesn't use Blob for the binary payload and for the metadata.

landing-page/content/common/statistics-format-spec.md

rdblue · 2022-04-03T23:03:27Z

landing-page/content/common/statistics-format-spec.md

+
+| Blob type                      | Description |
+| ------------------------------ | ----------- |
+| ndv-long-little-endian         | 8-bytes integer stored little-endian and representing number of distinct values


Is this needed or can we put some of the small payloads in a map stored in table metadata?

Yes, we can do that. That would obviously make the spec more complicated. Do you think it is worth it?

I meant to make this document simpler by not including this. We can have a map of summary stats in the table spec changes.

i wanted the writer to sort blobs by type, so that NDVs end up within single read.
Of course, this still requires two reads to get this information (1 for the footer, and 1 for the NDV).
Is it a problem? IDK. We do lots of reads when planning a scan over an Iceberg table.

if you feel like this is a problem, or a suboptimality that we shouldn't have, I can remove it from here and add it to table spec instead.

I think this approach adds a lot of complexity. If a blob can be distilled into a few scalar values, like the theta sketch and a single NDV, then I would simply put the values into blob metadata. No need for additional blobs and encodings.

I think this approach adds a lot of complexity.

I think it's actually simpler, because it doesn't require any new type of entity like "blob metadata without the blob itself".

Note that NDV information can exist without a Theta sketch, so we can't just assume that NDV is just an attached info to the sketch.

For example, current Trino version already allows computing NDV and some engine & SPI changes are required to support Theta.

landing-page/content/common/statistics-format-spec.md

rdblue

Awesome work, @findepi! I have a few comments to clarify and make it more strict but this is a great start.

I also had a bigger idea to make this a header-based format. I think that would simplify it a bit. I'd like to hear what you think.

FYI @aokolnychyi, @RussellSpitzer, @jackye1995.

findepi · 2022-04-04T14:28:18Z

(AC, pushed changes)

RussellSpitzer · 2022-04-04T14:41:05Z

Is it in the scope for this doc to specify the lifecycle of these files? Like is a particular stats file owned by a snapshot or is this just some generic description of a file with possibly more information whose lifestyle may need special care?

I mostly just want to understand better the relationship between one of these files and a given table. Should a table know where all these files are? or is it up to a reader to know where potential files may be? Please let me know if this is out of scope for this PR.

rdblue · 2022-04-04T16:06:36Z

landing-page/content/common/statistics-format-spec.md

+| type       | JSON string            | yes      | See [Blob types](#blob-types)
+| columns    | list of JSON long      | yes      | list of column IDs the blob was computed for
+| offset     | JSON long              | yes      | The offset in the file where the blob contents start. Reader should assume the value can be more than 2^32.
+| length     | JSON long              | yes      | The length of the blob stored in the file


Uncompressed length would be nice.

I'd rather let the compression handled that. Both ZSTD and LZ4 can write uncompressed size in the frame header.

IIRC some compressions libraries (in Go) don't make the information easy to obtain from the frame header. I don't recall which compression formats.

rdblue · 2022-04-04T16:08:51Z

Is it in the scope for this doc to specify the lifecycle of these files?

I don't think so. This is focused on how we manage blobs. The design doc linked at the top covers the lifecycle for stats, and other docs cover it for indexes.

Should a table know where all these files are? or is it up to a reader to know where potential files may be?

The design doc states that these files are tracked in each snapshot with a "statistics" key and an object that tracks what payloads are in the file.

findepi · 2022-04-06T10:47:15Z

AC

should i also create a PR for table spec changes?

findepi · 2022-04-11T18:05:20Z

a proposed implementation of reader/writer: apache/iceberg#4537

landing-page/content/common/index-and-statistics-format.md

rdblue · 2022-05-12T02:21:17Z

landing-page/content/common/index-and-statistics-format.md

+
+#### `ndv-long-little-endian` blob type
+
+8-bytes unsigned integer stored little-endian and representing number of distinct values.


I don't see why this is needed, since it isn't a sketch. I think it is valuable to have either in the metadata that tracks this file, or in the footer. But a blob just to hold it seems overly complex to me.

What about introducing a summary object to BlobMetadata that can contain high-level information like this? Then we can specify that for the apache-datasketches-theta-v1 blob, the summary metadata will have a "ndv-count" field with the number of distinct values as a JSON long.

I don't see why this is needed, since it isn't a sketch.

This file format is not about "sketches". It's a container format for statistics and indexes, and some of them may be "sketches". This is why the spec uses a more vague term "blob".

NDV number is defined as a blob type, to avoid special casing this particular information.
Otherwise we should add this information in the Table spec directly, not in this file format at all.

This is a summary of a blob. We don't need a specific blob type for it and I think that having one makes this overly complex. What is the argument for making this a blob vs keeping it in the sketch metadata as a summary?

landing-page/content/common/index-and-statistics-format.md

findepi · 2022-05-13T08:50:45Z

Thank you @rdblue for your awesome review.

AC; PTAL

landing-page/content/common/index-and-statistics-format.md

findepi · 2022-05-17T15:05:51Z

AC

findepi · 2022-05-18T12:26:47Z

This may be ready to go. PTAL.

landing-page/content/common/index-and-statistics-format.md

findepi · 2022-05-26T12:40:40Z

Per offline conversation, the file format got a name. Meet Puffin.

findepi · 2022-05-27T07:20:43Z

@rdblue anything else I could improve here?

findepi · 2022-05-27T08:55:18Z

@rdblue @pvary

per @pvary 's comment #80 (comment)
it looks like this PR targets a wrong repo, even though it's under review for two months.

Please confirm and I will move over.

landing-page/content/common/puffin.md

Add a specification for Puffin format, a container file format to store indices and stats for Iceberg tables.

findepi · 2022-06-02T12:08:15Z

per @pvary 's comment #80 (comment)
it looks like this PR targets a wrong repo, even though it's under review for two months.

Per apache/iceberg#4944, and offline confirmation, the PR is moved to Iceberg repo apache/iceberg#4944