Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document requirements/recommended process for updating cluster TLS certs/keys #30575

Open
Tracked by #44609 ...
jimmycuadra opened this issue May 9, 2016 · 42 comments
Open
Tracked by #44609 ...
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/en Issues or PRs related to English language lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/security Categorizes an issue or PR as relevant to SIG Security. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jimmycuadra
Copy link
Contributor

If you're running Kubernetes with the master components secured with TLS, eventually you will need to update the certificate and key, and possibly even the CA cert. Right now there is no documentation about how this should be approached. What services need to be restarted when the CA cert, endpoint cert, or private key are changed on disk? If all the master components are running via the kubelet's static manifest directory, is it sufficient to just restart kubelet on the host? Or is it necessary to somehow manually restart each containerized master component that reads those files?

@pwittrock
Copy link
Member

@kelseyhightower Do you know how folks normally figure this out?

@jimmycuadra
Copy link
Contributor Author

I just had to go through this process more or less manually, and learned that restarting the kubelet does not restart any k8s master components that were launched by static manifests the kubelet is observing. In order to get kube-apiserver to restart and pickup the new TLS credentials, I had to move the kube-apiserver manifest out of the directory kubelet was watching, restart kubelet, then move the manifest back in and restart the kubelet again.

This definitely needs to be documented. Hopefully there is a better way of telling the kubelet to restart the master components, too. If not, there really should be.

@jimmycuadra
Copy link
Contributor Author

Would a maintainer please bring a Kubernetes developer who can answer this into the conversation? Not having a way to restart the kube-system components to pick up TLS credential changes is a blocking issue for my team to roll out Kubernetes in production. Thanks!

@jimmycuadra
Copy link
Contributor Author

Comment from chancez on Slack which helps with a workaround for the time being:

ive only tried once/twice to test a solution for an issue a user was having, but i basically did what you did, but I did docker kill $container instead of messing with the manifest files

@eugene-chow
Copy link
Contributor

I just had to re-make my cert and deploy it to all the nodes. These are my notes:

  1. Create the new certs and deploy it to all the nodes
  2. Restart each and every node one-by-one. Alternatively, restart every kubernetes service if you don't want to bring down the node completely.
  3. Delete the default service account token, which is linked to the old TLS certs, in every namespace (eg. default-token-b7scp). A new default token will be automatically created.
  4. Restart every pod that uses the default token, so that it reads the new token.

@liggitt
Copy link
Member

liggitt commented Feb 22, 2017

steps 3 and 4 are not required if you keep the old key as an additional valid public key (can pass multiple --service-account-key-file args to the apiserver), or use a dedicated service account token signing key separate from the apiserver's private tls key

@eugene-chow
Copy link
Contributor

Thanks for the heads up. I forgot to mention that my setup isn't a production system but Kelsey's https://github.com/kelseyhightower/kubernetes-the-hard-way. The cluster in this tutorial uses one cert to rule them all.

@xiangpengzhao
Copy link
Contributor

/sig docs

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@ankon
Copy link
Contributor

ankon commented Jan 29, 2018

Clearly this is a still-existing problem, and it is really something that needs to be addressed minimally in documentation, ideally even in code.

@emilhdiaz
Copy link

This is major blocker for us to run Kubernetes into production. Any advice would be much appreciated!

I already tried to simulate a certificate rotation in a test environment and couldn't do so without causing downtime for running applications.

/remove-lifecycle stale

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@ankon
Copy link
Contributor

ankon commented Mar 5, 2018

/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@trunet: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sftim
Copy link
Contributor

sftim commented Nov 20, 2021

/reopen

@k8s-ci-robot
Copy link
Contributor

@sftim: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 20, 2021
@k8s-ci-robot
Copy link
Contributor

@jimmycuadra: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sftim
Copy link
Contributor

sftim commented Nov 20, 2021

/transfer website

@k8s-ci-robot k8s-ci-robot transferred this issue from kubernetes/kubernetes Nov 20, 2021
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 20, 2021
@sftim
Copy link
Contributor

sftim commented Nov 20, 2021

/language en
/remove-area security
/sig security

@k8s-ci-robot k8s-ci-robot added the language/en Issues or PRs related to English language label Nov 20, 2021
@k8s-ci-robot
Copy link
Contributor

@sftim: Those labels are not set on the issue: area/security

In response to this:

/language en
/remove-area security
/sig security

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the sig/security Categorizes an issue or PR as relevant to SIG Security. label Nov 20, 2021
@sftim
Copy link
Contributor

sftim commented Nov 20, 2021

/sig auth

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Nov 20, 2021
@sftim
Copy link
Contributor

sftim commented Nov 20, 2021

/kind feature

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2022
@ritazh
Copy link
Member

ritazh commented Apr 11, 2022

/assign @aramase

@divya-mohan0209
Copy link
Contributor

Hello @aramase : Please may we have an update on whether this is being progressed at the moment and if you have any updates?

@mehabhalodiya
Copy link
Contributor

@aramase I don't see any updates; so unassigning you. Please feel free to assign, if you come back here again and are willing to work on it. Thank you! 🙂
/unassign @aramase

@tomkivlin
Copy link
Contributor

This appears to have been completed in 92b56db and could be closed?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2022
@tomkivlin
Copy link
Contributor

Ah my bad - I am still working on this, but slowly. Will aim to get a PR ready by end of this week.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2022
@sftim
Copy link
Contributor

sftim commented Jan 30, 2023

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 30, 2023
@sftim
Copy link
Contributor

sftim commented Mar 2, 2023

Slightly relevant to #39694

@sftim
Copy link
Contributor

sftim commented Jul 9, 2023

@tomkivlin, you made a start on this. How did that go?

@tomkivlin
Copy link
Contributor

@sftim I did, and then forgot about it, sorry. I have blocked out some time in the second half of August to get this done. Apologies. I think the issue I created a branch for was #14725 so I'll reassign that to me as well.

@tomkivlin
Copy link
Contributor

@tomkivlin
Copy link
Contributor

/assign

@sftim
Copy link
Contributor

sftim commented Jul 30, 2023

Duplicated by (part of) #42258

@sftim
Copy link
Contributor

sftim commented Jan 2, 2024

Anyone who'd like to help with this issue is very welcome to work on it.

@sftim
Copy link
Contributor

sftim commented Jan 21, 2024

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 21, 2024
@sftim
Copy link
Contributor

sftim commented Mar 14, 2024

Help is (still) welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/en Issues or PRs related to English language lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/security Categorizes an issue or PR as relevant to SIG Security. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Backlog
Development

No branches or pull requests