Skip to content

aws_sigv4: URI encoding#7600

Closed
outscale-mgo wants to merge 7 commits intocurl:masterfrom
outscale:http_aws_sigv4_encoding
Closed

aws_sigv4: URI encoding#7600
outscale-mgo wants to merge 7 commits intocurl:masterfrom
outscale:http_aws_sigv4_encoding

Conversation

@outscale-mgo
Copy link
Contributor

@outscale-mgo outscale-mgo commented Aug 20, 2021

It seems AWS signature V4 was lacking part of the algorith, AWSv4 should in theory URL Encode, the data before using it in the algorytme.

problem is that depending of whatever V4 signature is use for s3 or another API, the URL encoding is not the same.
https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html
https://docs.aws.amazon.com/general/latest/gr/sigv4-create-canonical-request.html

In one case the data is single encoded, and in the other case, it is double encode.

So the patch add a new option to curl, that enable the user to choose the number of encoding you want. (default is 0)

Here is the changes of this PR:

  • add --aws-sigv4-nb-encoding option, default to 0
  • add some static function in https_aws_sigv4 file .
  • uri_encore data->state.up.query and data->state.up.path , if the option is set.

some notes:

  1. I didn't test, nor implement double encoding yet.
  2. I couldn't use strlen_url in urlapi.c as the algorytm use for the encoding is slightly different
  3. I'm not sure hostname trim is useful, maybe curl already do it when storing it in hostname ?
  4. AWS documentation say nothing explicitly about = encoding, but say that each value and key of the query need to be encoded, while the = use to slipt the key from the value need to be left as it. as data->state.up.query contain all queries parameters as a single string, I need to determine which '=' are part of a value and which one a "true" equal

should fix: #7559 #9717

@drTr0jan
Copy link

drTr0jan commented Sep 2, 2021

Hi @outscale-mgo,

thx for awesome patch! Good job!

But there're some questions about it. How do you think about PR to you rep?

About hostname... It's not need to trim the hostname. Curl uses trimmed hostname variable. AWS recommends to trim hostname while using "Host:" header, but Curl generate 'Host:' header after using hostname variable.

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch from 176d74f to 9318d88 Compare September 2, 2021 11:52
@outscale-mgo
Copy link
Contributor Author

outscale-mgo commented Sep 2, 2021

Hi @outscale-mgo,

thx for awesome patch! Good job!

Hi @drTr0jan
Thanks for the feedback !

But there're some questions about it. How do you think about PR to you rep?

As I want to upstream my change I would prefer a review here over a PR on my rep, though I wouldn't say no to a PR that improve some of my patchs.

Also I will archive my repo once my modifications are upstream (and unarchive it if I need it again).

About hostname... It's not need to trim the hostname. Curl uses trimmed hostname variable. AWS recommends to trim hostname while using "Host:" header, but Curl generate 'Host:' header after using hostname variable.

Ok, I've just force push a new version that don't trim, thanks for the feedback again !

Comment on lines 134 to 138
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not need to parse a space character separately, Default case parses a space correctly.
Why do you parse it?

Comment on lines 128 to 131
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use the clause?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful to verify multibyte characters also. For example: %TESTNUMBER/testapi/test+=+тест+++测试+
What do you think about it?

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 2 times, most recently from da99393 to 1e812fa Compare September 2, 2021 14:29
Copy link

@drTr0jan drTr0jan Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful to verify 2-byte (not only 3-byte UTF-8 symbols) characters also. For example: GET /%TESTNUMBER/testapi/test===カールロックスкурл рокс!!!=== HTTP/1.1
What do you think about it?

Copy link
Contributor Author

@outscale-mgo outscale-mgo Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure why not,
Also I've try to write "curl rocks" in katakana (japanese), which seems to give "kaaru rokusu". but can also be translate as karl rocks also. and I guess "КарлРокс" mean karl rocks in cyrilics isn't it ?
Also google translate seems tu tell me that "керл Рокс" would be a better translation, is that right ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let it be "курл рокс" )

Comment on lines 84 to 85
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strlen_uri() returns the length of the given URI this function is a modified version of strlen_url in urlapi.c

Comment on lines 112 to 114
Copy link

@drTr0jan drTr0jan Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strcpy_uri() copies a url to an output buffer and URL-encodes the URI accordingly this function is a modified version of strcpy_url in urlapi.c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "a URL". The leading U in URL is pronounced as "you" with a starting consonant sound, and thus "a URL" is correct.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bagder, okay, thanks.

Comment on lines 191 to 193
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming vars to: canonical_uri and canonical_query_str ?

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 2 times, most recently from f1b8c64 to cbd8841 Compare September 6, 2021 09:52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to read these functions now without thinking: why does this file need to almost duplicate these functions?

How do they differ and why isn't tweaking the ones in urlapi.c better?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@outscale-mgo, I think, good idea is taking if(in_query) and default clauses out from case-statement.
case '&' clause should be executing always only in if(in_query) for "without thinking".

Copy link
Contributor Author

@outscale-mgo outscale-mgo Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like:

  for(;
      *iptr;         /* until zero byte */
      iptr++) {

    if(in_query) {
      if(*iptr == '=') {
        if(!count_equal++) {
          msnprintf(optr, 4, "%%%02X", *iptr);
          optr += 3;
        } else
          *optr++=*iptr;
        continue;
      } else if (*iptr == '&')
        count_equal = 0;
    }

    switch(*iptr) {
    case '%':
      *optr++ = *iptr;
      if(iptr[1]) {
        iptr++;
        *optr++ = (char)TOUPPER(*iptr);
        if(*iptr != '%' && iptr[1]) {
          iptr++;
          *optr++ = (char)TOUPPER(*iptr);
        }
      }
      break;
    default:
      if(urlchar_needs_escaping(*iptr, in_query)) {
        msnprintf(optr, 4, "%%%02X", *iptr);
        optr += 3;
      }
      else
        *optr++=*iptr;
      break;
    }
  }

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it is curl's job to encode the equals signs like this. The user has provided a URL to use. Why would the user provide a wrong one for curl to re-encode like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.aws.amazon.com/AmazonS3/latest/API/images/sigV4-using-query-params.png
In this document AWS say that you need to encode every parameter and arguments of the query string. problem is that curl seems to contain the whole query string as a single, so I had to find a way to determine which part need to be encode, and which part of need to be left as it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bagder, AWS doesn't recommend using standard UriEncode functions provided by development platform may not work because of differences in implementation and related ambiguity in the underlying RFCs.

Copy link
Contributor Author

@outscale-mgo outscale-mgo Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 small details:
the equal is for example, if an user call something like https://api.region.provider/Call&Argument====&Argument2=2= in which case, if as I've understand correctly, all equal but the first after Argument and Argument2 must be encode. (but this is a corner case, which I'm not sure is even possible)

Also I could patch the functions in url API, but it would add a lot of if/else which might be ugly (for examples curl use lovercase for hexa and AWS require upper case), also as @drTr0jan said, AWS advise to reimplement the encoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's borderline crazy since a URL is already URL encoded as otherwise it's not a URL. Does the spec also clearly say what bytes it needs to re-encode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, doesn't that also imply that you should first URL decode the query string so that you get it truly canonicalized? Otherwise existing URL-encoded characters like %41 might thwart this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think about it, I do think the AWS doc is about not yet encoded URL, (so not URL).
also the issue is about using "https://${ES_DOMAIN_ENDPOIN}/tempo_data-*/event/_mapping" instead of "https://${ES_DOMAIN_ENDPOIN}/tempo_data-%2A/event/_mapping".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though it might still be useful to add an option to force encoding ? (in which case we won't have to deal with already encoded URL)

Copy link
Member

@bagder bagder Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is that they want to make the url canonical, and to do that you need to first URL decode and then URL encode because curl always operate on a URL as input and that means it is encoded. So a * could be passed in as a * or as %2a or as %2A. They're all fine according to the URL syntax. But they will then genrate different hashes and that's not what a user wants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't "copy a URI" it (re-)encodes a query string according to the aws sig v4 rules.

How about instead call it encode_query() ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns the new length of a query string, it's not really a "uri". strlen_encoded_querry() maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, doesn't that also imply that you should first URL decode the query string so that you get it truly canonicalized? Otherwise existing URL-encoded characters like %41 might thwart this.

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 2 times, most recently from 9d34c53 to 28c0865 Compare September 14, 2021 14:33

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you compare *iptr with '%'?

Copy link
Contributor Author

@outscale-mgo outscale-mgo Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if you have an encoded URL, containing '%2a' for example I can use '%2A' in the canonical header. I detect '%' so I can decode -> encode URL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about for example %41 present in the original URL. Shouldn't that rather get decoded to A in a canonical URL?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@outscale-mgo, but you've detected '%' while case '%' yet, haven't you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about for example %41 present in the original URL. Shouldn't that rather get decoded to A in a canonical URL?

ok I'll do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@outscale-mgo, but you've detected '%' while case '%' yet, haven't you?

yes, because %% is deferent from %01 which represent an ASCII character, anyway I'm going to change the algo to make a full decode -> encode.

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 2 times, most recently from f511c01 to 7025344 Compare September 27, 2021 15:17
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do need this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, I've take inspiration from what is done in urlapi.c, where TOLOWER is use there.
TBH I don't know what's the advantage over using toupper, I've just assume it was there for a reason, maybe it was wrong, @bagder ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to use Curl_raw_toupper() and not toupper() to avoid it being affected by locales.

Comment on lines 94 to 111
Copy link

@drTr0jan drTr0jan Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better understanding it can be like as:

for(ptr = (unsigned char *)url; *ptr; ptr++) {
  if (in_query) {
    switch (*ptr) {
      case '=':
        if(count_equal++)
          newlen += 2;
        newlen++;
        continue;
      case '&':
        count_equal = 0;
    }
  }
  if(urlchar_needs_escaping(*ptr, in_query))
    newlen += 2;
  newlen++;
 }

Copy link
Contributor Author

@outscale-mgo outscale-mgo Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, if you go into the case '=': it will bug, because, newlen will be increment twice or 4 times.
so the code should be:

for(ptr = (unsigned char *)url; *ptr; ptr++) {
  if (in_query) {
    switch (*ptr) {
      case '=':
        if(count_equal++)
          newlen += 2;
        newlen++;
        continue; // replace break by continue here.
      case '&':
        count_equal = 0;
    }
  }
  if(urlchar_needs_escaping(*ptr, in_query))
    newlen += 2;
  newlen++;
 }

TBH I don't think it simplify the code a lot, so I don't think I will change that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, continue should be there. Sorry for my typo.

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch from 7025344 to 0f55c65 Compare November 5, 2021 10:41

ascii_code = strtol(num_str, NULL, 16);
if(ascii_code) {
*optr++ = (char)ascii_code;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CWE 197 advisory: potential loss of information in cast
(at-me in a reply with help or ignore)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello an idea for that ?
would an

if (ascii_code > 255)
    handle_errror()

be enough ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char is often signed, so that typecast is not limiting the value to 0 - 255. And since you want an unsigned value, maybe strtoul() is actually what you want? The fact that char is also not certain to be signed or unsigned is also a reason to use it unsigned explicitly.

The code in escape.c does this:

      hex = strtoul(hexstr, &ptr, 16);

      in = curlx_ultouc(hex); /* this long is never bigger than 255 anyway */

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch from 0f55c65 to 13b5c43 Compare November 8, 2021 17:02
@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2021

This pull request introduces 1 alert when merging 13b5c43557ea6404b10c82260afd43c436d92b3e into 351b181 - view on LGTM.com

new alerts:

  • 1 for Implicit function declaration

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 2 times, most recently from 08180ac to 8591bdf Compare November 16, 2021 15:14
@outscale-mgo outscale-mgo reopened this Jan 31, 2022
@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 2 times, most recently from ee2e49b to fe3edad Compare January 31, 2022 15:18
@outscale-mgo
Copy link
Contributor Author

Ok, I'll try to do that.

So I've try to use only urlapi to manipulate the query string, it seems urlapi doesn't decode/encode the query yet. (but doesn't seems too hard to change either)
So if I had to put all url handling code in urlapi.c/h here's how I would do it: add 3 flags to urlapi set/get

#define CURLU_ENCODE_UPPERCASE (1<<12)  /* Encode URL using uppercase hex */
#define CURLU_ENCODE_DECODE_QUERY (1<<13)        /* enable the encoding/decoding of the query string */
#define CURLU_SORT_QUERY (1<<14)        /* Sort Query on set */

modify this line

#define CURLU_URLDECODE (1<<6)          /* URL decode on get */

to

#define CURLU_URLDECODE (1<<6)          /* URL decode on get, or on set on query if CURLU_ENCODE_DECODE_QUERY is set */

and then patch get/set methode so it handle sort, and encode/decode of query.
then in sigv4 code, In would use curl_url_set with CURLU_ENCODE_DECODE_QUERY | CURLU_SORT_QUERY | CURLU_URLDECODE to decode the query, so if I have some lowercased encoded url, I will decode it, and sort the query at the same time.
and then use curl_url_get with CURLU_ENCODE_DECODE_QUERY | CURLU_URLENCODE in order to decode query in a format that will always suit the signature process.
Is this okay for you ?

My bad it seems urlapi does handle query.

@bagder
Copy link
Member

bagder commented Jan 31, 2022

#define CURLU_ENCODE_UPPERCASE (1<<12) /* Encode URL using uppercase hex */

This one I think can probably be fine to provide publicly.

#define CURLU_ENCODE_DECODE_QUERY (1<<13) /* enable the encoding/decoding of the query string */

Why not use the decode flag when asking for or setting the query?

#define CURLU_SORT_QUERY (1<<14) /* Sort Query on set */

This doesn't feel like material for the public API. It's just too weird and specific.

#define CURLU_URLDECODE (1<<6) /* URL decode on get, or on set on query if CURLU_ENCODE_DECODE_QUERY is set */

I don't understand this change.

@outscale-mgo
Copy link
Contributor Author

outscale-mgo commented Mar 8, 2022

Hello, some update on this PR:

I did try to use urlapi but fail to do so (my test is here: https://github.com/outscale-dev/curl/commits/http_aws_sigv4_encoding_urlapi)

The major problem I encounter is that aws url encoding, encode characters as '/', '=' and some other reserved characters.
So in order to use urlapi I would have to recode partially strlen/cpy_url only for aws-sigv4 code, so I guess it's still better to keep the code in http_aws_sigv4.c.

Also, it seems that depending of the API use (see here: https://docs.aws.amazon.com/general/latest/gr/sigv4-create-canonical-request.html and here: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html) the number of encoding that must be done is not the same. (double encoding on equal on some parameter and 1 on some other)

So I've add an option to specify the number of encoding that should be done, and with 0 encoding, none is done, but can still be done manually, which I'd like to keep, so if there is a bug in the encoding, curl user will still be able to do it themself.

Note that presently, only 0 and 1 are supported,.

I still need to write test, and improve the man, but as I didn't give feedback since some time, I've push my current work on this branch.

Edit: the docs and tests should be done now

@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch 5 times, most recently from f948538 to 55be259 Compare March 11, 2022 14:46
@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch from 55be259 to f630d83 Compare March 21, 2022 09:31
Following that: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html

note that the documentation say nothing explicitely about '=' encoding,
but say that each value and key of the query need to be encoded while
the '=' use to slipt the key from the value need to be left as it.
as data->state.up.query contain all queries parameters as a single string,
I need to determine which '=' are part of a value and which one a "true" equal

should fix: curl#7559

Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
add new option aws-sigv4-nb-encoding which is here
to tell curl to encode url in aws v4 signature

Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
@outscale-mgo outscale-mgo force-pushed the http_aws_sigv4_encoding branch from f630d83 to 0933688 Compare March 21, 2022 09:35
@bagder bagder changed the title aws signature V4 URI encoding aws_sigv4: URI encoding Aug 29, 2022
@bagder
Copy link
Member

bagder commented Aug 29, 2022

Is this still work in progress? It has merge conflicts and I would like to review the code. Please also update the description of this PR so that it is very clear what the purpose of it is.

@outscale-mgo
Copy link
Contributor Author

outscale-mgo commented Aug 30, 2022

Is this still work in progress? It has merge conflicts and I would like to review the code. Please also update the description of this PR so that it is very clear what the purpose of it is.

hello, I've try to update the PR description, it it's purpose is more clear.
It's still work in progress but, "on hold", as it's very easy to work around the encoding problem by encoding the data manually.
I'll came back on this PR, once the header refactoring is merge.

I can rebase it in the meantime, if you want, but I would prefer to take more time, so I can add the double encoding support.

@bagder
Copy link
Member

bagder commented Aug 30, 2022

We can wait, sure!

@bagder
Copy link
Member

bagder commented Jul 27, 2023

This PR just died. Feel free to submit a new/reopen if work resumes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

--aws-sigv4 doesn't sign requests with * correctly

4 participants