Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reader and writer for Puffin, indexes and stats file format #4537

Merged
merged 1 commit into from Jun 15, 2022

Conversation

findepi
Copy link
Collaborator

@findepi findepi commented Apr 11, 2022

Format documentation: #4944

build.gradle Outdated Show resolved Hide resolved
@findepi
Copy link
Collaborator Author

findepi commented Apr 13, 2022

@rdblue i applied most of the comments (no more jackson-datatype-jdk8, json-related annotations, get in accessor names). I also switched over from base16-encoded test "fixtures" to test resources.

I switched writing to use ByteBuffer, this removes removed data copying upon compression.
This is as a fixup for now. Please let me know if you want me to use ByteBuffer in the reader/writer APIs as well?
It seems byte[] works there well (eg because uncompressed size is known, so no need to re-allocation), so please confirm before I apply further changes.

@rdblue
Copy link
Contributor

rdblue commented Apr 13, 2022

@findepi, I would prefer to use ByteBuffer everywhere. If the backing buffer was easy to allocate and work with, that's great. But you're not forced to reallocate if it is not.

@findepi
Copy link
Collaborator Author

findepi commented Apr 20, 2022

( squashed and rebased on top of current version of #4534, no other changes in this push )

@findepi
Copy link
Collaborator Author

findepi commented Apr 20, 2022

AC
@rdblue @singhpk234 thanks for your review, let me know what else I can change.

@findepi
Copy link
Collaborator Author

findepi commented Apr 22, 2022

AC & rebased after #4534 merged.

@findepi
Copy link
Collaborator Author

findepi commented Jun 14, 2022

Rebased after #5019 merged.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, just a few suggestions here and there

return type;
}

public List<Integer> inputFields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec mentions that input fields are JSON longs, so I'm just wondering whether this should be a List<Long>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg field IDs are integers, so the implementation is chosen to be limited to integers.

We can maybe change fields | list of JSON long in the puffin spec to be list of integers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should update the spec to match the type used by the table spec for field IDs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi findepi force-pushed the findepi/stats-file branch 2 times, most recently from d3ca971 to be6bb83 Compare June 15, 2022 11:01
@findepi
Copy link
Collaborator Author

findepi commented Jun 15, 2022

Thanks @rdblue and @nastra for your reviews!

Updated the code accordingly. Please take another look.

@findepi findepi requested review from danielcweeks, nastra and rdblue and removed request for rdblue and danielcweeks June 15, 2022 11:52
ByteBuffer footerPayload = PuffinFormat.compress(footerCompression, footerJson);
outputStream.write(MAGIC);
int footerPayloadLength = footerPayload.remaining();
writeFully(footerPayload);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: why not call IOUtil.writeFully(outputStream, footerPayload) directly rather than having a private writeFully method that just adds the output stream?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!buffer.hasRemaining()) {
return;
}
byte[] chunk = new byte[WRITE_CHUNK_SIZE];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer? Alternatively, you could pass the temporary buffer in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could pass the temporary buffer in.

This poses sizing challenge. I.e. the caller needs to provide a reasonably sized buffer.

Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer?

Sure, this is feasible. Do you happen to know what would be the expected reuse ratio for such a buffer?

Alternatively we can have a static buffer pool that lends buffers to the current thread.
(assuming we have a problem that we want to fix here)

@rdblue rdblue merged commit a7f7c1a into apache:master Jun 15, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 15, 2022

Thanks, @findepi! This looks good to me. We can still follow up, but I think the majority of the changes are ready so I've merged it to avoid keeping a big PR outstanding.

@findepi findepi deleted the findepi/stats-file branch June 20, 2022 07:32
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants