Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Puffin, indices and stats file format specification #4944

Merged
merged 1 commit into from Jun 6, 2022

Conversation

findepi
Copy link
Collaborator

@findepi findepi commented Jun 2, 2022

Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.

@findepi
Copy link
Collaborator Author

findepi commented Jun 2, 2022

@findepi
Copy link
Collaborator Author

findepi commented Jun 2, 2022

This was already reviewed at apache/iceberg-docs#69.
Should be good to go, per my understanding.


- `Magic`: four bytes, same as at the beginning of the file.
- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the
blobs in the file, with the structure described below,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: period instead of comma, or we could change to comma instead of period for magic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i think it would be best if it was a comma. however, i removed all the EOL punctucation, because it looked weird on the item with sub-items.

- `Magic`: four bytes, same as at the beginning of the file.
- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the
blobs in the file, with the structure described below,
- `FooterPayloadSize`: a length in bytes of the `FooterPayload` (compressed),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does compressed in brackets signify(given it can be uncompressed)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed this isn't clear, changing to (after compression, if compressed)

format/puffin-spec.md Outdated Show resolved Hide resolved
format/puffin-spec.md Outdated Show resolved Hide resolved
| Field Name | Field Type | Required | Description |
|-------------------|-------------------| -------- | ----------- |
| type | JSON string | yes | See [Blob types](#blob-types)
| fields | list of JSON long | yes | List of field IDs the blob was computed for; the order of items is used to compute sketches stored in the blob.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically talks about sketches.
Since it is generic, should we remove this from here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wording comes from apache/iceberg-docs#69 (comment)


| Codec name | Description |
|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| lz4 | Single [LZ4 compression frame](https://github.com/lz4/lz4/blob/77d1b93f72628af7bbde0243b4bba9205c3138d9/doc/lz4_Frame_format.md), with content size present |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does it make sense to add a link to the doc from a released branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the precise version (commit) is the only way to guarantee the link remains valid.
for example, the .md file could be renamed on the release branch.

@findepi
Copy link
Collaborator Author

findepi commented Jun 3, 2022

@karuppayya thanks for your review!

updated

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm +1 with a small comment about being more explicit about the context where Puffin can be used.

format/puffin-spec.md Outdated Show resolved Hide resolved
format/puffin-spec.md Outdated Show resolved Hide resolved
format/puffin-spec.md Outdated Show resolved Hide resolved
Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.
@findepi
Copy link
Collaborator Author

findepi commented Jun 6, 2022

thanks @rdblue for your detailed review.

changes applied.

where

- `Magic`: four bytes, same as at the beginning of the file
- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioning of lz4 as the compression option should ideally be placed near the first mention of compression imo. Or perhaps a mention of the compression section would suffice.

I had to go looking for an indication of the compression type after reading this line which was somewhat confusing for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove mention of compression from here rather than elaborate this enumeration item with compression specification. However, it was added for the clarity here.

@rdblue
Copy link
Contributor

rdblue commented Jun 6, 2022

I think this is ready to commit. Next step is to bring it up on the dev list and vote on it.

@rdblue rdblue merged commit 9a8b83c into apache:master Jun 6, 2022
@findepi findepi deleted the findepi/stats-format branch June 7, 2022 07:14
@findepi
Copy link
Collaborator Author

findepi commented Jun 7, 2022

thanks for the merge!

I think this is ready to commit. Next step is to bring it up on the dev list and vote on it.

@rdblue who can do this?

@findepi
Copy link
Collaborator Author

findepi commented Jun 9, 2022

per offline conversation, sent an email to the iceberg dev list (https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q)
hope this is what i was supposed to do.


The blobs can be of a type listed below

#### `apache-datasketches-theta-v1` blob type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, alpha-sketch is just one example, and the idea here is it support various statistics/secondary index that is to be defined, right? Example, other sketches, or even bloom filter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the idea is to support bloom filters, dictionaries, and other index types in addition to sketches like this.

We'd also like to get histograms in there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, led me to wonder on #4945, whether these stats/indices would apply to an individual data file, as well as across whole snapshot (I guess both will be very useful).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter for this file format. I think the next step is to start thinking about adding these at the partition level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea makes sense, its separate.

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
@lirui-apache
Copy link
Contributor

Hey @findepi , do we have any docs regarding how the puffin files should be tracked by snapshots/manifests? We had implemented data file level indices, e.g. bloom filter, bitmap, and wonder whether we could adapt to this new format.

@findepi
Copy link
Collaborator Author

findepi commented Sep 20, 2022

@lirui-apache sure! the Puffin files are currently being deployed to hold table statistics.
The spec changes were added in #4945
and then made visible in TableMetadata in #5450
the whole implementation is still ongoing, so e.g. you cannot set these yet on a table (see eg #5794)

Note that all the above is centered about certain use-case of the file format, namely the table-level statistics.
The file-level indices probably requires a different set of spec changes and API, but the idea is that it could still use Puffin as the container format.

@lirui-apache
Copy link
Contributor

@findepi Thanks for the pointers! Our indices are used for data skipping so they have to be file-level. Currently we just store serialized indices in the files. I guess switching to Puffin format can help us better track metadata like snapshot and sequence numbers.

@findepi
Copy link
Collaborator Author

findepi commented Sep 20, 2022

@lirui-apache sounds very interesting.
BTW there is #secondary-index slack channel (https://apache-iceberg.slack.com/archives/C02TZ3NMBRC/p1654850392577929) which contains more people interested in use-case like yours!

@lirui-apache
Copy link
Contributor

@lirui-apache sounds very interesting. BTW there is #secondary-index slack channel (https://apache-iceberg.slack.com/archives/C02TZ3NMBRC/p1654850392577929) which contains more people interested in use-case like yours!

Joined. Thanks 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants