Skip to content
This repository has been archived by the owner on Feb 6, 2024. It is now read-only.

Add stats format specification #69

Closed
wants to merge 1 commit into from

Conversation

findepi
Copy link
Contributor

@findepi findepi commented Apr 1, 2022

Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.

This follows earlier discussion in https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY

@findepi
Copy link
Contributor Author

findepi commented Apr 1, 2022

The format description is based on the doc (https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY) and has been preliminarily discussed with @rdblue @losipiuk @alexjo2144 and disseminated as a proposal on the Iceberg dev list and slack. New comments and feedback is welcome.

The description is added next to spec file, so within landing-page. Please advice whether this is the right place.
Also, it's not integrated with the spec yet. It seems to be that the table's reference to stats needs another look before being formalized. See doc link above.


| Field Name | Field Type | Required | Description |
| ---------- | ---------------------- | -------- | ----------- |
| blobs | list of Blob objects | yes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about list of BlobMetadata objects? That way this doesn't use Blob for the binary payload and for the metadata.


| Blob type | Description |
| ------------------------------ | ----------- |
| ndv-long-little-endian | 8-bytes integer stored little-endian and representing number of distinct values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed or can we put some of the small payloads in a map stored in table metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that. That would obviously make the spec more complicated. Do you think it is worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to make this document simpler by not including this. We can have a map of summary stats in the table spec changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted the writer to sort blobs by type, so that NDVs end up within single read.
Of course, this still requires two reads to get this information (1 for the footer, and 1 for the NDV).
Is it a problem? IDK. We do lots of reads when planning a scan over an Iceberg table.

if you feel like this is a problem, or a suboptimality that we shouldn't have, I can remove it from here and add it to table spec instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach adds a lot of complexity. If a blob can be distilled into a few scalar values, like the theta sketch and a single NDV, then I would simply put the values into blob metadata. No need for additional blobs and encodings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach adds a lot of complexity.

I think it's actually simpler, because it doesn't require any new type of entity like "blob metadata without the blob itself".

Note that NDV information can exist without a Theta sketch, so we can't just assume that NDV is just an attached info to the sketch.

For example, current Trino version already allows computing NDV and some engine & SPI changes are required to support Theta.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, @findepi! I have a few comments to clarify and make it more strict but this is a great start.

I also had a bigger idea to make this a header-based format. I think that would simplify it a bit. I'd like to hear what you think.

FYI @aokolnychyi, @RussellSpitzer, @jackye1995.

@findepi
Copy link
Contributor Author

findepi commented Apr 4, 2022

(AC, pushed changes)

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Apr 4, 2022

Is it in the scope for this doc to specify the lifecycle of these files? Like is a particular stats file owned by a snapshot or is this just some generic description of a file with possibly more information whose lifestyle may need special care?

I mostly just want to understand better the relationship between one of these files and a given table. Should a table know where all these files are? or is it up to a reader to know where potential files may be? Please let me know if this is out of scope for this PR.

| type | JSON string | yes | See [Blob types](#blob-types)
| columns | list of JSON long | yes | list of column IDs the blob was computed for
| offset | JSON long | yes | The offset in the file where the blob contents start. Reader should assume the value can be more than 2^32.
| length | JSON long | yes | The length of the blob stored in the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncompressed length would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather let the compression handled that. Both ZSTD and LZ4 can write uncompressed size in the frame header.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC some compressions libraries (in Go) don't make the information easy to obtain from the frame header. I don't recall which compression formats.

@rdblue
Copy link
Contributor

rdblue commented Apr 4, 2022

Is it in the scope for this doc to specify the lifecycle of these files?

I don't think so. This is focused on how we manage blobs. The design doc linked at the top covers the lifecycle for stats, and other docs cover it for indexes.

Should a table know where all these files are? or is it up to a reader to know where potential files may be?

The design doc states that these files are tracked in each snapshot with a "statistics" key and an object that tracks what payloads are in the file.

@findepi
Copy link
Contributor Author

findepi commented Apr 6, 2022

AC

should i also create a PR for table spec changes?

@findepi
Copy link
Contributor Author

findepi commented Apr 11, 2022

a proposed implementation of reader/writer: apache/iceberg#4537


#### `ndv-long-little-endian` blob type

8-bytes unsigned integer stored little-endian and representing number of distinct values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is needed, since it isn't a sketch. I think it is valuable to have either in the metadata that tracks this file, or in the footer. But a blob just to hold it seems overly complex to me.

What about introducing a summary object to BlobMetadata that can contain high-level information like this? Then we can specify that for the apache-datasketches-theta-v1 blob, the summary metadata will have a "ndv-count" field with the number of distinct values as a JSON long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is needed, since it isn't a sketch.

This file format is not about "sketches". It's a container format for statistics and indexes, and some of them may be "sketches". This is why the spec uses a more vague term "blob".

NDV number is defined as a blob type, to avoid special casing this particular information.
Otherwise we should add this information in the Table spec directly, not in this file format at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a summary of a blob. We don't need a specific blob type for it and I think that having one makes this overly complex. What is the argument for making this a blob vs keeping it in the sketch metadata as a summary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@findepi findepi force-pushed the findepi/stats-format branch 4 times, most recently from 95d8fc8 to 95bfd64 Compare May 13, 2022 08:50
@findepi
Copy link
Contributor Author

findepi commented May 13, 2022

Thank you @rdblue for your awesome review.

AC; PTAL

@findepi
Copy link
Contributor Author

findepi commented May 17, 2022

AC

@findepi findepi requested a review from rdblue May 17, 2022 15:05
@findepi
Copy link
Contributor Author

findepi commented May 18, 2022

This may be ready to go. PTAL.

@findepi
Copy link
Contributor Author

findepi commented May 26, 2022

Per offline conversation, the file format got a name. Meet Puffin.

@findepi
Copy link
Contributor Author

findepi commented May 27, 2022

@rdblue anything else I could improve here?

@findepi
Copy link
Contributor Author

findepi commented May 27, 2022

@rdblue @pvary

per @pvary 's comment #80 (comment)
it looks like this PR targets a wrong repo, even though it's under review for two months.

Please confirm and I will move over.

Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.
@findepi
Copy link
Contributor Author

findepi commented Jun 2, 2022

per @pvary 's comment #80 (comment)
it looks like this PR targets a wrong repo, even though it's under review for two months.

Per apache/iceberg#4944, and offline confirmation, the PR is moved to Iceberg repo apache/iceberg#4944

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants