Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose table statistics in Table API #4741

Merged
merged 1 commit into from Sep 29, 2022

Conversation

findepi
Copy link
Collaborator

@findepi findepi commented May 10, 2022

Follows #4945 and #5021

@rdblue
Copy link
Contributor

rdblue commented May 10, 2022

@findepi, I think this is too big to go in on its own. Can you split it into separate PRs for different parts?

@findepi
Copy link
Collaborator Author

findepi commented May 10, 2022

@rdblue i have this split. #4537 has already been reviewed.

@findepi findepi force-pushed the findepi/stats-in-table branch 2 times, most recently from 2a9585d to 290152b Compare May 18, 2022 14:21
@alexjo2144
Copy link
Contributor

We need to remember to keep remove_orphan_files from deleting Statistics files

@findepi
Copy link
Collaborator Author

findepi commented May 19, 2022

We need to remember to keep remove_orphan_files from deleting Statistics files

Thanks for reminding about this here. It's on my todo list, see eg trinodb/trino#12317 (comment)

@findepi findepi force-pushed the findepi/stats-in-table branch 2 times, most recently from 24b1a32 to 3e94f20 Compare May 20, 2022 09:14
@findepi
Copy link
Collaborator Author

findepi commented May 20, 2022

We need to remember to keep remove_orphan_files from deleting Statistics files

Fixed & test added.

@github-actions github-actions bot added the spark label May 20, 2022
@findepi findepi force-pushed the findepi/stats-in-table branch 5 times, most recently from 486c731 to ac219c0 Compare May 25, 2022 10:42
@findepi
Copy link
Collaborator Author

findepi commented Aug 29, 2022

Updated to the current state of #5021

@findepi
Copy link
Collaborator Author

findepi commented Aug 30, 2022

Rebased after #5021 has been merged to make Conflicts disappear.

@findepi
Copy link
Collaborator Author

findepi commented Sep 6, 2022

@rdblue please let me know if you have any comments.

*
* @return the current statistics files for the table
*/
List<StatisticsFile> statisticsFiles();
Copy link
Contributor

@rdblue rdblue Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose these directly? What about instead adding an API for finding statistics? That way consumers don't need to decide which one to use themselves.

Either way, this is for consuming stats and we should add the implementation of the update stats API in its own PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose these directly?

Yes, these should be exposed, just like we expose other information about table (files, their min/maxes, etc.) and we provide convenience functionality on top of them (like planFiles).

What about instead adding an API for finding statistics?

Good idea. What would you like to see here?

public class GenericBlobMetadata implements BlobMetadata {
public class GenericBlobMetadata implements BlobMetadata, Serializable {

public static BlobMetadata from(org.apache.iceberg.puffin.BlobMetadata puffinMetadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we make it so that the Puffin BlobMetadata implements BlobMetadata? Then this would be unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puffin BlobMetadata contains all the blob metadata information from the Puffin footer.
In particular, it contains more fields, like offset and length, which are not carried over to the table metadata.

We can make Puffin BlobMetadata implement the BlobMetadata, but

  • we don't want to serialize (as in Serializable) the Puffin BlobMetadata, as we're not interested in sending over those additional fields (I know these are small today, and maybe are never big in the future)
  • there is a challenge with equality semantics. GenericBlobMetadata implements equals, but with more complicated class hierarchy it wouldn't be useful anymore.

Let me know what you think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to serialize (as in Serializable) the Puffin BlobMetadata

i see #4741 (comment) now

@findepi
Copy link
Collaborator Author

findepi commented Sep 19, 2022

Per #4741 (comment), I've shrunk the scope of this PR
and extracted #5794 and also #5795 .

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2022

Thanks, @findepi! I'll give this another look.

@findepi
Copy link
Collaborator Author

findepi commented Sep 19, 2022

Thanks @rdblue . #5799 might be the easiest to start from.

@findepi findepi changed the title Add implementation for statistics information in table snapshot Expose table statistics in Table API Sep 21, 2022
This adds support in `Table` for the table statistics.
@findepi
Copy link
Collaborator Author

findepi commented Sep 27, 2022

@rdblue this is ready to review.

@rdblue rdblue merged commit ae8e669 into apache:master Sep 29, 2022
@rdblue
Copy link
Contributor

rdblue commented Sep 29, 2022

Thanks, @findepi!

@findepi findepi deleted the findepi/stats-in-table branch September 30, 2022 09:46
@findepi
Copy link
Collaborator Author

findepi commented Sep 30, 2022

Thank you for the merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants