Session cookie auth scheme for console#326
Conversation
| } | ||
| } | ||
| Ok(cookies) | ||
| } |
There was a problem hiding this comment.
This is based on how Rocket does it — note that the Rocket dev, @SergioBenitez, is also the dev of the cookie crate.
Dropshot currently does not provide this functionality but it easily could. Or not. Doing it here seems fine too.
| // Some(exp) => { | ||
| // // continue | ||
| // } | ||
| // } |
There was a problem hiding this comment.
I figure we should check both cookie expiration time and database? or we could just ignore the one on the cookie since the one we have stored is really the source of truth
There was a problem hiding this comment.
The expires set on a cookie is intended for the client. After the expires date lapses from the client's perspective the cookie will be removed. I'd say if we get a cookie we should try to verify it. The expiration we really care about should be managed server side. Typically you might have something like Redis or some other datastore with a TTL setting on a row that'd remove it after it expires. Cockroach doesn't have that though (see bulk deleting expired data).
davepacheco
left a comment
There was a problem hiding this comment.
I realize this is still draft so these comments might be getting ahead of things.
As I look at this I'm finding myself lacking a lot of context about how this typically works. Is there an RFD (or RFC or anything else you'd recommend) describing exactly how Cookies are securely used for session authn (like the format of the cookie, what fields are expected to be present, how replay attacks are avoided, are cookies always bearer tokens, etc.).
| Ok(string) => string, | ||
| Err(_) => continue, | ||
| }; | ||
| for chunk in raw_str.split(';').map(|s| s.trim()) { |
There was a problem hiding this comment.
I'm not sure if this matters, but according to MDN, the key-value pairs in the Cookie header are separated by ; , not just ;.
Along these lines, is there something out there that's already vetted that parses the key-value pairs out of the Cookie header? It looks like Cookie::parse already does this, but we're pre-splitting the header and using it here to parse just a chunk?
Sorry if a lot of this is understood. It's obviously outside my wheelhouse. I'm just nervous because this seems like pretty important code to get right and I'm trying to really convince myself it's correct.
There was a problem hiding this comment.
I was following this comment: rwf2/cookie-rs#137 (comment)
I admit I did not look into the reasoning.
I agree that it's odd that the cookie lib doesn't have something that does this, and so do others. @SergioBenitez seems to think it's too simple to warrant a utility for it, or rather any appropriately general utility would be too complex to be worth it. See also https://github.com/SergioBenitez/Rocket/blob/7ffe3a73601aabdfe3190a809e660538acd5bdce/core/lib/src/request/request.rs#L995-L1007.
|
I appreciate the early feedback. I'll put together some stuff on how this is supposed to work. In the CP tactical you missed, we discussed Justin and I sprucing up RFD 169 and probably splitting it into two RFDs: one for Node vs. Rust, and one for console auth, and we could cover that in the latter. |
|
In the meantime this is a good resource that addresses almost everything in your list: https://cheatsheetseries.owasp.org/cheatsheets/Session_Management_Cheat_Sheet.html When I polish up this PR, I’ll write up how it achieves the various recommendations. |
| expected_status, | ||
| ) | ||
| .await | ||
| } |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
4ece207 to
b5b28c7
Compare
| // TODO: the 16 should be a constant somewhere, maybe even in config? | ||
| let mut random_bytes: [u8; 16] = [0; 16]; | ||
| rng.fill_bytes(&mut random_bytes); | ||
| hex::encode(random_bytes) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
|
||
| let authn = apictx.external_authn.authn_request(&rqctx).await?; | ||
| println!("authn_request(): {:?}", authn); | ||
|
|
There was a problem hiding this comment.
@davepacheco unlike Spoof, this authn method depends on the rest of Nexus because it depends on the sessions table, so in order to test it I had to add this call to a real Nexus endpoint. Ideally I could test by adding a special test-only endpoint that I can add the auth call to and leave the real endpoints alone, but I'm not sure how to do that. Ideas?
There was a problem hiding this comment.
Answering my own question after some sleep, as it often is: the way these often work when they're pluggable middleware type things (kind of like what we have here) is to factor out the session lookup function into something that can be passed in. It would probably have to take a generic argument referring to the server context type, which in the real Nexus case would look like
async fn lookup_session(ctx: &Arc<ServerContext>, token: &str) -> Result<Session, Error> {
ctx.nexus.session_fetch(token.to_string()).await
}and in the test case it would ignore the argument and hard code in a map:
async fn lookup_session(_ctx: <something>, token: &str) -> Result<Session, Error> {
match token {
"good" => Ok(<good_token>),
"expired" => Ok(<expired_token>),
_ => Err(<not found>),
}
}I will try to figure out how to do this.
There was a problem hiding this comment.
Which part are you talking about testing? For "spoof", I separated out the guts (simple as it is) so that I could test it with a unit test without a whole HTTP server. I also have an integration test that looks at the external HTTP response for various authn results. That one should cover you here, too.
For what it's worth, while working on "spoof" I considered adding a separate "test" dropshot server with its own API that we could use for stuff like this. But I found I could do it entirely outside with a custom Dropshot server in the integration test rather than Nexus, and that seemed better.
There was a problem hiding this comment.
Sean helped me with this (change here: 54282ad )— the scheme is now generic so it can take any kind of context that implements SessionBackend, which just means it has a session lookup function. I need to make the type of Session generic and write some tests and you'll see what I mean
| -- TODO: let's stay agnostic about what this means for now, but obviously the | ||
| -- simplest interpretation is that it points to a row in the User table | ||
| user_id UUID NOT NULL |
There was a problem hiding this comment.
I think we have to store a record of users existing somehow because we need something to reference in the ACLs. I understood the conclusion of those discussions to be about how much additional info to store, besides the basic requirement of an internal UUID plus an external ID representing the user in the customer's auth system.
There was a problem hiding this comment.
I understood the conclusion from the product chat to be only that we wouldn't support POST /users and DELETE /users/:user. I think it's an important open question exactly what data is stored and how it's kept in sync with the directory, but I expect it'll be more than nothing.
| let token = match cookies.get(SESSION_COOKIE_COOKIE_NAME) { | ||
| Some(cookie) => cookie.value(), | ||
| None => return SchemeResult::NotRequested, | ||
| }; |
There was a problem hiding this comment.
If you're aiming to cut down on the control flow paths, this is a decent candidate for ok_or_else, to convert the option to a Result (and to use the ? operator more):
let token = cookies.get(SESSION_COOKIE_NAME)
.map(|c| c.value())
.ok_or_else(|| SchemeResult::NotRequested)?;Ultimately it's not too different from what you're already doing, but I've seen some minor preference for "reduce the number of explicit return calls" in Rust functions to make following control flow easier.
Thanks. That link is very helpful! |
davepacheco
left a comment
There was a problem hiding this comment.
Nice. It's starting to feel a lot more real.
| time_deleted TIMESTAMPTZ, | ||
|
|
||
| token STRING(32) PRIMARY KEY, | ||
| time_expires TIMESTAMPTZ NOT NULL, |
There was a problem hiding this comment.
nit: store "last_used" instead of "time_expires"? That's just as useful for expiring it, gives the client more flexibility in interpreting it, and is also a potentially useful fact for us to record.
I haven't gotten to where we delete records for expired sessions, but I suspect we'll want an index on whichever field we use for this so that we can quickly enumerate the oldest sessions.
| /* | ||
| * Sessions for use by web console. | ||
| */ | ||
| CREATE TABLE omicron.public.Session ( |
There was a problem hiding this comment.
Feel free to punt this to later, but a relatively easy thing I saw in the cheat sheet is binding the session to certain properties of the client, like the source IP and user agent headers. We could add those here and check them on fetch.
There was a problem hiding this comment.
That's a good idea. Your IP would change if you went from wi-fi to LTE or something, but that's rare enough that I wouldn't feel bad. Will probably make an issue or add to the tracking list rather than implementing here. (Edit: I see it's already in #341)
| /* | ||
| * Sessions for use by web console. | ||
| */ | ||
| CREATE TABLE omicron.public.Session ( |
There was a problem hiding this comment.
suggestion: Call this ConsoleSession? I suspect we're going to wind up with a lot of things called sessions (e.g., TLS session, cloud shell session, VNC session, serial console session, etc.)
Alternatively, maybe better: we could try to keep the console stuff in a separate CockroachDB schema or database (not cluster). That might be good for improved security and smaller blast radius.
There was a problem hiding this comment.
Yes, I'll change the name. I'd want to wait to see how much more stuff we have. It might only be this table and maybe some user preferences.
| DiskAttachment, | ||
| Instance, | ||
| Rack, | ||
| Session, |
There was a problem hiding this comment.
I think this is used for cases where we're reporting something like "object not found: session "foo"". For the usual security reasons, I wonder if we don't want to leak that information, and so maybe don't need this. (If we do, thoughts on "console session"?)
| config::SchemeName::Spoof => Box::new(HttpAuthnSpoof), | ||
| config::SchemeName::SessionCookie => { | ||
| Box::new(HttpAuthnSessionCookie) | ||
| // TODO: gross |
There was a problem hiding this comment.
Do you know why this is needed here but not for the the Spoof one?
There was a problem hiding this comment.
Sean helped me fix this by moving the type annotation to map: c2a9329#diff-5a2c045a788ce52d2612b2ffa2e163fe66cbee0fb72776cef9df4e9695262213R49
The reason it wasn't a problem with Spoof alone is that the error was about the arms of the match not matching. When there was only Spoof, there was only one arm.
|
|
||
| let authn = apictx.external_authn.authn_request(&rqctx).await?; | ||
| println!("authn_request(): {:?}", authn); | ||
|
|
There was a problem hiding this comment.
Which part are you talking about testing? For "spoof", I separated out the guts (simple as it is) so that I could test it with a unit test without a whole HTTP server. I also have an integration test that looks at the external HTTP response for various authn results. That one should cover you here, too.
For what it's worth, while working on "spoof" I considered adding a separate "test" dropshot server with its own API that we could use for stuff like this. But I found I could do it entirely outside with a custom Dropshot server in the integration test rather than Nexus, and that seemed better.
| .get_result_async(self.pool()) | ||
| .await | ||
| .map_err(|e| { | ||
| public_error_from_diesel_pool_create( |
There was a problem hiding this comment.
What will this return if the 'id' is in use? That seems almost certainly a server bug (500 error).
There was a problem hiding this comment.
It returns one of these. I think I'd consider it the job of the nexus wrapper or even the endpoint calling that to turn it into a 500. I just added an entry to #341 to make sure we're not leaking info about the nature of the error when we respond.
Err(ObjectAlreadyExists { type_name: ConsoleSession, object_name: "a_token" })
There was a problem hiding this comment.
I'm not sure that's right. ObjectAlreadyExists is already an api::external::Error, right? That's the type we turn directly into an HTTP response and the usual pattern is to just ? and bail early when you get one. It feels pretty error prone to have cases where consumers have to check which one they got and potentially translate it to a different variant.
There was a problem hiding this comment.
Yep, you’re right. Now I understand what @smklein meant when he was helping me with this PR and said it wasn’t ideal that we’re using these external errors basically everywhere. The DB Result types should not be using the public-facing error type. I’ll look into it but it probably goes beyond this PR. For now I’ll fix this one so it doesn’t leak info.
There was a problem hiding this comment.
There was a problem hiding this comment.
We may as well just do the better thing, as long we're going to do #347.
| if session.time_expires < Utc::now() { | ||
| return SchemeResult::Failed(Reason::BadCredentials { | ||
| actor, | ||
| source: anyhow!("expired session"), |
There was a problem hiding this comment.
I'd suggest putting the expire time in here, and maybe even the time we checked against.
| // TODO: the 16 should be a constant somewhere, maybe even in config? | ||
| let mut random_bytes: [u8; 16] = [0; 16]; | ||
| rng.fill_bytes(&mut random_bytes); | ||
| hex::encode(random_bytes) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| // request with expired session should 401 | ||
| let _ = get_projects_with_cookie( | ||
| &client, | ||
| Some("session=expired"), |
There was a problem hiding this comment.
What about trying with one that's actually expired (i.e., use session_create_with with a time in the past)?
There was a problem hiding this comment.
that's what I'm doing! these could be more nicely grouped visually, I did think about that
let ago_5_minutes = Utc::now() - Duration::seconds(300);
let _ =
nexus.session_create_with("expired".into(), user2, ago_5_minutes).await;There was a problem hiding this comment.
Sorry! My mistake. I saw "expired" was the session token itself and thought it was a magic value.
eeb1750 to
e269bc7
Compare
8ac90fa to
abe663f
Compare
|
This is close enough to start looking at, though I still have a couple of things left to address in this PR that I don't think belong in followups:
|
|
|
||
| fn absolute_timeout(&self) -> Duration { | ||
| Duration::hours(8) | ||
| } |
There was a problem hiding this comment.
these should probably be in the config — @davepacheco what's a nice way to plumb through the values here? I almost want to stick the entire config on server context so I can do something like
fn absolute_timeout(&self) -> Duration {
Duration::minutes(self.config.session_absolute_ttl_minutes)
}There was a problem hiding this comment.
That sounds good. I think it's reasonable to expose simple tunables like this on the Context. The thing I worry about is exposing more complicated things that ought to be better encapsulated. It would be easy for someone to mistakenly think they can figure out the server's bound address by grabbing config.dropshot_external.bind_address for example. As long as this is behind a method, that doesn't seem like a problem. We could even put this in a tunables section of the config so that and only store that in the Context.
I'd just make the name more descriptive -- absolute_timeout doesn't say anything to me about being config-related or session-related.
There was a problem hiding this comment.
Ah right. I suppose the fact that's it's a method on the SessionStore trait doesn't help when you're just doing context.absolute_timeout() or you see it in an autocomplete menu.
| // without passing TestServerContext around as mutable | ||
| struct TestServerContext { | ||
| sessions: Mutex<HashMap<String, FakeSession>>, | ||
| } |
There was a problem hiding this comment.
this was an interesting learning experience. because the logic for session expiration requires making multiple calls to the session store, I couldn't extract out a pure function of the request that's easily unit testable, like the Spoof scheme has. Instead I end up testing the full authn method with a fake context that implements SessionStore with a hashmap as the store.
|
|
||
| let authn = apictx.external_authn.authn_request(&rqctx).await?; | ||
| println!("authn_request(): {:?}", authn); | ||
|
|
There was a problem hiding this comment.
This is not going to stay in, I just need to figure out out to use a custom server for integration tests like WhoamiServer is being used to test the spoof scheme. There are some annoying parts but they're not at all insurmountable.
The main annoying thing is that in order to implement SessionStore on Arc<WhoamiServerState> (source) I have to wrap it in a newtype because I don't own SessionStore or Arc.
There was a problem hiding this comment.
Well, it turns out the server state didn't need to be wrapped in Arc anyway, so getting rid of that solves my orphan rule problem. The other problem is that the diff is gnarly as hell and the test is getting messy. So I think what I'll do is still basically this, but move it into its own file so it's not mixed up with all the spoof tests. Unfortunately there's a lot of similar boilerplate between that we'll want to think about cleaning up later.
cookie-auth...cookie-whoami done properly in bcd45d3
|
|
||
| pub mod authn; // Public only for testing | ||
| mod config; | ||
| pub mod config; // Public only for testing |
There was a problem hiding this comment.
this can go away once I stop using the real nexus for the integration tests
| let authn = omicron_nexus::authn::external::Authenticator::new( | ||
| authn_schemes_configured, | ||
| ); | ||
| Arc::new(WhoamiServerState { authn }) |
There was a problem hiding this comment.
had to make this stop being Arc so I could implement SessionStore directly on WhoamiServerState
There was a problem hiding this comment.
I'm surprised that worked but that's no problem.
I would think that if you impl'd SessionStore on WhoamiServerState, you could still use Arc<WhoamiServerState> where you need a something that impl'd SessionStore, as long as you took the value by reference and didn't need it to be mutable.
There was a problem hiding this comment.
I tried it that way and it complained that Arc<WhoamiServerState> didn't have SessionStore impl'd, but it's very possible I wasn't doing it right.
| let delete_again = datastore.session_hard_delete(token.clone()).await; | ||
| assert_eq!(delete_again, Ok(())); | ||
|
|
||
| let _ = db.cleanup().await; |
There was a problem hiding this comment.
I followed Tess's lead here adding tests directly to datastore. I find these tests pretty reassuring. Too bad they're slow to spin up, at least on Mac (see #53).
|
|
||
| -- to be used for cleaning up old tokens | ||
| CREATE INDEX ON omicron.public.ConsoleSession ( | ||
| time_created |
There was a problem hiding this comment.
Still seems like this should be time_last_used right?
There was a problem hiding this comment.
I liked @teisenbe's idea that if this was time_created, a) the index would change less often, and b) it's basically just as good for the purpose of running a cleanup query.
There was a problem hiding this comment.
I'm not sure why (b) would be true...but we can see how it goes.
There was a problem hiding this comment.
For me (b) is motivated by these assumptions:
- both TTLs are pretty short (less than a day)
- the max number of sessions per day for the foreseeable future (say 50k = 10k users x 5 sessions per day) would mean we probably don't need to run a cleanup job more than daily.
There was a problem hiding this comment.
Yeah, we're using about 100 bytes per row (not as familiar with the CockroachDB internals, so this is a ballpark), so even if we had 1M sessions per day and the absolute limit were the original 1 week, the datastore is still not particularly large. Indexing on time_last_used seems like a risk for creating database contention from all API requests that renew the session due to the roughly sequential index (cautioned against in the cockroach docs). time_created also has this problem, but it'll only be on every session creation instead of every renewal, which should reduce the risk. Though, it sounds like Cockroach has a database-side fix for this in the works, with their experimental hash-sharded indexes.
| authn_schemes_external = [] | ||
| [authn] | ||
| schemes_external = [] | ||
| session_idle_timeout_minutes = 60 |
There was a problem hiding this comment.
Thank you for putting the units in this name :)
| let authn = omicron_nexus::authn::external::Authenticator::new( | ||
| authn_schemes_configured, | ||
| ); | ||
| Arc::new(WhoamiServerState { authn }) |
There was a problem hiding this comment.
I'm surprised that worked but that's no problem.
I would think that if you impl'd SessionStore on WhoamiServerState, you could still use Arc<WhoamiServerState> where you need a something that impl'd SessionStore, as long as you took the value by reference and didn't need it to be mutable.
| match ctx.session_update_last_used(token.clone()).await { | ||
| Some(_) => {} | ||
| None => { | ||
| // couldn't renew session wtf |
There was a problem hiding this comment.
Should this be a TODO or 500 or something?
There was a problem hiding this comment.
Yep, I'll fix it. I don't think I want it to 500 because at this point the request is successfully authenticated, it's just that the next one might fail because we didn't move the active window forward.
| // request. | ||
| let updated_session = ctx.session_update_last_used(token.clone()).await; | ||
| if updated_session.is_none() { | ||
| debug!(log, "failed to extend session") |
There was a problem hiding this comment.
We'll definitely want to have a counter for this that we raise an alarm on, I'd say.
If console is going to remain client-side only, then Nexus likely needs to be able to accept cookie-based auth. What I envision is very basic and standard: on login (by whatever mechanism), Nexus generates a random session token, stores it in a table with a TTL and an associated user, and sends it down to the console in a cookie. On subsequent requests, if the cookie is present, we look it up in the database to tell whether the request is authenticated.
This PR handles the part where we auth subsequent requests. I will do a provisional login endpoint that sets the cookie in a separate PR.
A lot of the decisions here make reference to the OWASP Session Management Cheat Sheet.
Follow-up work
See #345 v1 session management cleanup
How it works
When a request comes in, we look at the cookie with the key
session(a name we choose; a generic name is recommended). If the cookie isn't there, we consider this aNotRequested(apassorcontinueornextif you think of this as a middleware).If it is there, the value of the cookie is a random token that may point to a row in the
ConsoleSessiontable. We look up the row, and if it exists, we have to decide whether the session is active or expired. There are two limits on session length: an idle TTL and an absolute TTL (see the relevant section in the OWASP guide). The absolute TTL is a maximum lifetime on the token regardless of extension. Say the idle TTL is 1 hour and the absolute TTL is 8 hours. As long as the user pokes the site and makes API requests every 59 minutes, their session will keep getting extended, up until 8 hours, at which point it will get expired no matter what and they will have to reauthenticate.Expiration and extension logic
If a session cookie is present on the request:
time_last_usedis more thanIDLE_TTLago, the session is expired, so fail.time_createdis more thanABSOLUTE_TTLago, the session is expired, so fail.last_usedtimestamp to now.Manual expiration, i.e., logout
Not implemented in this PR. There will be a logout endpoint that does this.
When the user logs out we have to expire their session. This means clearing the cookie in their browser, but more importantly marking the session expired in the database. Right now this is assumed to mean a hard delete, but I'm still thinking about that and it could easily be changed. This is also relevant to the question of what we do when we get a request with an expired token. If expiring means deleting, we should probably also delete expired sessions when requests for them come in.
Components
Session token
Relevant OWASP quotes
Session tokens are random strings that point to a row in the sessions table. OWASP says you need at least 64 bits of entropy. I went with 16 bytes = 128 bits of random data and hex encoded it, which makes the session token a 32 character string. I did this because Thomas Ptacek said it's "hard to beat."
Cookie
The cookie name is
session. We will setSecureandHttpOnlyattributes on it (probably a separate PR with provisional login endpoints). TTL is TBD.