Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing more Envoy configuration knobs #2225

Closed
pims opened this issue Feb 13, 2020 · 23 comments
Closed

Exposing more Envoy configuration knobs #2225

pims opened this issue Feb 13, 2020 · 23 comments
Assignees

Comments

@pims
Copy link
Contributor

pims commented Feb 13, 2020

Please describe the problem you have
[A clear, concise, description of the problem you are facing. What is the problem that feature X would solve for you?]

When deploying Contour/Envoy in a large cluster with heterogenous workloads (gRPC, websockets, h2, etc.), some Envoy settings become quite important to get right.

Here's a subset of settings we have configured through a custom patch:

{
  DrainTimeout:      safetime(customCtx.DrainTimeout),
  IdleTimeout:       safetime(customCtx.IdleTimeout), 
  RequestTimeout:    safetime(customCtx.RequestTimeout),
  StreamIdleTimeout: safetime(customCtx.StreamIdleTimeout),

  PerConnectionBufferLimitBytes: customCtx.PerConnectionBufferLimitBytes,
  MaxConcurrentStreams:          customCtx.MaxConcurrentStreams,
  ...
}

Most of these are part of the HTTPConnectionManager filter configuration.
We have been using these settings with great success for several months in production now, and would be interested in contributing these changes upstream.

My questions to the Contour team: would this fall into the scope of what you envision for Contour?

If you believe that this is out of scope for this project, would you be open to minor code changes that would make it easier for people to patch.

A concrete example being the addition of requestTimeout in commit d197482

If this could be wrapped in some kind of type HTTPConnectionManagerConfiguration struct{}, adding unsupported fields would no longer require changing the function signature, which is used extensively through the comprehensive test suite.

Examples of features we have added over the years:

  • tracing
  • timeouts
  • limits for annotations (maxRetries, maxTimeout, maxPendingRequests…)
  • ext-authz

Worth noting, we were able to retain all of Contour's simplicity through custom defaults while giving us the ability to configure Envoy to our needs.

wdyt?

@jpeach
Copy link
Contributor

jpeach commented Feb 13, 2020

xref #1902
xref #1375

@stevesloka
Copy link
Member

#1192

@davecheney
Copy link
Contributor

davecheney commented Feb 13, 2020

Thank you for raising this issue. I want to give a bit of background.

The general policy of this project has been to avoid knobs and tuneable for their own sake where possible. We've done this in the past by adjusting our defaults, and only in the cases where there are no generally applicable defaults exposed those tuneable to contour users.

What I would propose is that rather than tackling six different aspects of connection manager parameters, which are only tangentially related by the fact that they are all kind of tuneable things and are all presented in a single issue, we break this request into different parts, because before implementing anything, we need to have a user story, a reason to do something, and a way to know when we did it correctly. The more different requests are mixed into one issue, the lower the chance of success.

Does that sound reasonable to you?

@pims
Copy link
Contributor Author

pims commented Feb 13, 2020

@davecheney totally reasonable. Thanks for sharing your reasoning.

I mentioned the HTTPConnectionManager configuration to share some of the options we chose to explicitly configure, and how we were able to configure all of them as a group of options, rather than individual changes.

Our starting point was the recommended configuration described at https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge

and discussed by @mattalberts in this thread: #1375 (comment)

What do you think about starting with max_connection_duration. I'll open an issue for it unless, there is another configuration setting that you think is more important.

@jpeach
Copy link
Contributor

jpeach commented Feb 13, 2020

@davecheney Although historically Contour has always aimed for the defaults to be universal, maybe there's a case for a general envoy parameterization mechanism so that we don't end up adding support for a lot of different parameters on a case-by-case basis?

@davecheney
Copy link
Contributor

WRT max connection duration, what’s the use case for limiting it?

@davecheney
Copy link
Contributor

davecheney commented Feb 13, 2020 via email

@pims
Copy link
Contributor Author

pims commented Feb 18, 2020

WRT max connection duration, what’s the use case for limiting it?

@davecheney our experience of providing ingress capabilities for all the teams using our cluster has led us to enforce cluster-wide limits to help with potential application issues.

One concrete example: we’ve hit this issue many times golang/go#30702 and applying max-duration for connection helped in mitigating its impact.

Choosing a good default applicable to different use cases seems somewhat tricky, hence the original desire to make this a configurable option with sane default.

@davecheney
Copy link
Contributor

@pims thanks for the information. What would be a good value for you?

@pims
Copy link
Contributor Author

pims commented Feb 18, 2020

@davecheney we’ve configured most of our timeouts to have a value of limit(annotationValue, 300s) with a drainTimeout of 30s.

Let me know if you'd like to move this conversation to a dedicated issue. And we can close this one.

@davecheney
Copy link
Contributor

davecheney commented Feb 18, 2020 via email

@linecolumn
Copy link

I do have a need to tune (little bit) Envoy with exactly same values.

If this is out of scope for Contour what other suggestion are there as a replacement?

@pims is it possible to open-source your work?

@youngnick
Copy link
Member

@linecolumn Can you please log a separate issue with your use case? As we say in our philosophy document, we prefer not to add tunables without trying to find better defaults first. Outlining what you are trying to do and how tuning those values will help you allows us to determine if we can change a default first, or if not, which settings need to be made tunable.

I think it's important to note that Envoy's timeouts are not straightforward, when I first looked into how to do client timeouts, the timeout I thought we needed to set was not correct.

@pims
Copy link
Contributor Author

pims commented Apr 16, 2020

@linecolumn https://github.com/mattalberts/contour/tree/v1.0.1-cloudflare

Some recent changes haven't been pushed yet, but this covers a few things we had to change.

@youngnick
Copy link
Member

youngnick commented Apr 23, 2020

I'm bringing some conversation from #2247 back here, as it's larger in scope than that single use case. Also, sorry for the giant wall of text.

After some discussion amongst the Contour maintainers, I'd like to talk to everyone about adding more tweakability for these timeout parameters. I can see from the discussion on #2247 that for many of these timeouts, default values are missing valid use cases, and from @pims' fork that you, our users, really want a lot more customisation than you have.

(As an aside, @pims Is there any chance I could schedule a call with you to talk about your fork? I'm really interested in how you've used all the tweakability you've added.)

What I really want to avoid is having the contour serve --help command be a hundred lines of config flags though, so I'd like to start small and try to build a pattern for where we go from here.

I'll add the action items for this into a separate comment so you don't need to parse this giant wall of text to figure out what's happening next.

As part of this, I'd like everyone watching this issue to think about how you'd change these defaults and why. Are there any changes that would be useful for all users of Contour, or are the changes you're making specific to your use cases?

About ext_auth and other settings, I'm still strongly of the opinion that, if Contour accepts configuration for something, we should surface information for that config back to the people who it affects. For services that involve passing a third service to Envoy, then I believe that Contour needs to be able to expose some information about that service back to its consumers. I've covered this requirement in #2325, and I'd really like some sort of answer to this before we do anything more to provide any functionality that uses a similar pattern.

For external auth, that means having a way for Contour to check that the external auth service is healthy and processing requests - the intent is to allow someone who creates a HTTPProxy object to be able to see why all of a sudden everything is failing by inspecting the HTTPProxy object itself, rather than having to know where to go to check on the auth service. Maybe there are reasons that's not practical, but if so, I want to have thought about them, and recorded why we're not doing that.

I should also be clear that I'm very skeptical of the idea that we need to allow the passing of arbitrary Envoy config. Part of the point of Contour is to abstract away Envoy details, I'm yet to be convinced this would be a good way to allow more flexibility.

@youngnick
Copy link
Member

youngnick commented Apr 23, 2020

Here's where I would like to take this configurability forward:

  • We add the ability to tune the defaults for the settings listed in the initial comment here to start with. (That's drain_timeout, idle_timeout, request_timeout, and stream_idle_timeout, and also the max_connection_duration from Make MaxConnectionDuration of core.HttpProtocolOptions configurable #2247.) These will be available as configuration file entries and command line options, and will set the defaults for all config passed to Envoy.
  • We'll also add the ability to override the defaults for these settings to HTTPProxy objects (we already have a TimeoutPolicy there to hold them.) Edit: I've crossed this out as there's a use case I hadn't considered - when the admin wants to change the defaults, but app devs want to set them to something the admin does not want. The example here is request_timeout, we currently allow app devs to set this to unlimited, no matter what the admin has said. We need to carefully consider what to do in this case before adding these things.
  • We'll consider the names and documentation we give to these settings carefully - the names that Envoy gives them make sense for Envoy's strict context, but I would prefer that the names we expose them with make sense for people who aren't intimately familiar with Envoy, but are familiar with running a reverse proxy. I'm thinking here of settings like the route-level timeout, which is kind of a response timeout, but also includes any processing time at the backend service, but it's also relevant that most of the Envoy timeouts include a few large paragraphs about caveats and other info. I'd like to be able to make it so you don't need to spend a long time parsing Envoy's documentation to use Contour's timeouts correctly.

@pims
Copy link
Contributor Author

pims commented Apr 23, 2020

@youngnick thanks for the detailed follow-up.

These will be available as configuration file entries and command line options

If we believe contour serve --help would get unwieldy, having these settings only defined in the configuration file, as a start, could be a low friction way of keeping complexity to a minimum.

Happy to get on a call and talk about our use case. I'll email you.

@youngnick
Copy link
Member

Thanks for the catch-up @pims. I agree that having all these settings on the command line will end up unwieldy, so I'll have a think about a more general stance there too.

@youngnick
Copy link
Member

youngnick commented May 4, 2020

Okay, from here, I'm going to propose the following:

  • we add a new timeouts stanza to the Contour configuration file.
  • Inside this stanza, we will have settings to allow the tweaking of drain_timeout, idle_timeout, request_timeout, stream_idle_timeout, and max_connection_duration.
  • They may have different names, as the names should reflect what the timeout does from the external client point of view, not Envoys.
  • The above work will be done under this issue.
  • Once this work is complete, we will move the current request-timeout value in the configuration file into this stanza, and add request-timeout-max and request-timeout-min optional parameters that specify the bounds in which HTTPProxy or other objects may configure the timeout. This will be the pattern from now for all timeouts that have Contour-tunable defaults with a HTTPProxy override. This will be done under a separate issue.

There's some other stuff in the queue before this, but I will work on scheduling this work for after 1.5.

@skriss skriss moved this from Unprioritized to In progress in Contour Project Board Jun 24, 2020
@skriss skriss self-assigned this Jun 24, 2020
@pims
Copy link
Contributor Author

pims commented Jul 1, 2020

From our slack discussion about timeouts.

idle-timeout is one that we had to tweak on a per upstream basis, so we made it an annotation and only enforce a limit/upperbound

@moderation
Copy link

I found this resource on Envoy timeouts a while ago and it looks quite comprehensive - https://github.com/chemicL/envoy-timeouts.

Please ensure that the documentation and clarity for timeouts makes it clear what applies to what covering:

  • HTTP/1.1
  • HTTP/2
  • Websockets
  • TCP

@skriss
Copy link
Member

skriss commented Jul 13, 2020

Still to do:

@skriss
Copy link
Member

skriss commented Jul 21, 2020

I'm going to close this out as the discussed timeouts have been added as options to the config file and documented. There's some additional work I'm still doing to reconcile the timeout syntax between the API and the config file (see #2697 and #2698), and #2690 has been created to cover moving request-timeout and adding min and max settings.

@skriss skriss closed this as completed Jul 21, 2020
Contour Project Board automation moved this from In progress to 1.7 release Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants