Skip to content

Allow environment variables in connection configs#943

Merged
zonotope merged 19 commits intomainfrom
feature/config-overrides
Dec 16, 2024
Merged

Allow environment variables in connection configs#943
zonotope merged 19 commits intomainfrom
feature/config-overrides

Conversation

@zonotope
Copy link
Contributor

This patch allows a user to specify environment variables in json-ld connection configs. The config system will then load the actual config values from the specified environment variable. Users specify the environment variable by setting the "@type" of a value map to "https://ns.flur.ee/system#envVar", as in the following example

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "@value": "FLUREE_CONNECTION_PARALLELISM",
        "@type": "envVar"
      },
      "cacheMaxMb": {
        "@value": "FLUREE_CACHE_MAX_MB",
        "@type": "envVar"
      }
    }
  ]
}

@zonotope zonotope requested a review from a team November 28, 2024 06:50
@zonotope zonotope self-assigned this Nov 28, 2024
#?(:clj (or (System/getenv v)
(throw (ex-info (str "Environment variable " v " unset.")
{:status 400 :error :db/missing-env-var})))
:cljs nil) ; TODO: Support environment variable overrides in cljs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw in this case rather than emit nil just to be on the safe side.

Copy link
Contributor

@dpetran dpetran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! When this issue came up I figured it would be really tough to specify env vars for a configuration graph, but the @type "envVar" approach is really elegant.

♻️

@zonotope
Copy link
Contributor Author

zonotope commented Dec 2, 2024

Nice work! When this issue came up I figured it would be really tough to specify env vars for a configuration graph, but the @type "envVar" approach is really elegant.

♻️

yeah, implementing this the way i was trying to do it at first was really tough. the old solution was far less flexible both for the user and the implementation, and much less powerful. thankfully, something higher priority came up (vector search) so i laid this down for a bit to work on that. that gave me some unintentional hammock time on this problem that allowed the back of my brain to marinate on it for a while. when the priority for this bumped back up this idea came to me pretty quickly after i started working on it again. i think it's better than my previous solution in every way.

@bplatz
Copy link
Contributor

bplatz commented Dec 2, 2024

Suggestion on syntax.

I think instead of these holding a custom data type, they should hold instead a node... this IMO aligns much better with RDF

So the node would simply look like:

{
        "envVar": "FLUREE_CACHE_MAX_MB"
}

Which you can image, actually looks more like below once expanded/inferred... but that step doesn't need to happen... just better illustrates what the node would really look like 'virtually':

{
        "@id": "_b:irrelevant"
        "@type": "EnvironmentVariable"
        "envVar": "FLUREE_CACHE_MAX_MB"
        "envVal": "" // string "value" would end up here
}

So to rewrite your example above, it would instead look like below. Where the value for e.g. cacheMaxMb can be an explicit scalar, or it could be a node that specifies the envVar used.

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB"
      }
    }
  ]
}

@zonotope
Copy link
Contributor Author

zonotope commented Dec 2, 2024

🤔

Suggestion on syntax.

I think instead of these holding a custom data type, they should hold instead a node... this IMO aligns much better with RDF

So the node would simply look like:

{
        "envVar": "FLUREE_CACHE_MAX_MB"
}

Which you can image, actually looks more like below once expanded/inferred... but that step doesn't need to happen... just better illustrates what the node would really look like 'virtually':

{
        "@id": "_b:irrelevant"
        "@type": "EnvironmentVariable"
        "envVar": "FLUREE_CACHE_MAX_MB"
        "envVal": "" // string "value" would end up here
}

So to rewrite your example above, it would instead look like below. Where the value for e.g. cacheMaxMb can be an explicit scalar, or it could be a node that specifies the envVar used.

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB"
      }
    }
  ]
}

@kurtharriger
Copy link
Contributor

kurtharriger commented Dec 3, 2024

I'm curious if you could leverage integrant to do the resolution
I noticed that for other systems you use type to specify a hierarchy for integrant keys so maybe it woudl be more consistent to do something like this?

(defn derive-node-id
  [node]
  (let [id (get-id node)]
    (cond
      (env-var-datatype? node)  (derive id :fluree.db/envVar) 
      (connection? node)          (derive id :fluree.db/connection)
      (system? node)              (derive id :fluree.db/remote-system)
      (memory-storage? node)      (derive id :fluree.db.storage/memory)
      (file-storage? node)        (derive id :fluree.db.storage/file)
      (s3-storage? node)          (derive id :fluree.db.storage/s3)
      (ipfs-storage? node)        (derive id :fluree.db.storage/ipfs)
      (ipns-nameservice? node)    (derive id :fluree.db.nameservice/ipns)
      (storage-nameservice? node) (derive id :fluree.db.nameservice/storage))
    node))

(defmethod ig/init-key :fluree.db/envVar  [var]
  (System/getenv (get (keyword "@id") var)) ;; or however that works 

@kurtharriger
Copy link
Contributor

Any thoughts about ability to specifying default values?
it might be nice to say have storage path be an env var but also have default if not specified.

@kurtharriger
Copy link
Contributor

Also curious about supporting java properties in addition to env vars?

The reason is that I can use System/setProperty in the repl and define values in deps.edn via jvmOpts in the profile. Env vars are readonly and I'm not sure of an easy way to set env vars when starting the repl form vscode. The best workaround I can think of to do this is sourcing an env file in bash and starting repl using command line and then attaching to vscode. Ability to use properties would be more convenient in repl and easier to write unit tests, if env and property resolution code paths are very similar than unit tests for properties could cover most of the env resolution as well.

@kurtharriger
Copy link
Contributor

Also perhaps not directly related to this PR, but I'm trying to understand how profiles work
From what I could see they weren't eg start-file takes profile but its ignored. I did see there is a config/finalize that maybe perhaps was intended can call on config after reading and then perhaps call start-config? I'm mot sure if that worked or not but if so I think they symantics may have changes as it looks like start-config now expects a raw string rather than parsed map

@kurtharriger
Copy link
Contributor

I also find the parsed integrant config map somewhat difficult to read with the angle bracket substitutions ...
What is the motivation behind these replacements? Could we just use string keys instead? although I found I can type :https://flur.ee as keyword just fine so at least some of these replacements seem unnecessary. I think the symbol I had most trouble with is :@id as clojure doesn't allow the :@ but I've been able to use (keyword "@id") so it seems special characters can still be keywords just maybe not keyword literals.

@zonotope
Copy link
Contributor Author

zonotope commented Dec 3, 2024

I'm curious if you could leverage integrant to do the resolution I noticed that for other systems you use type to specify a hierarchy for integrant keys so maybe it woudl be more consistent to do something like this?

(defn derive-node-id
  [node]
  (let [id (get-id node)]
    (cond
      (env-var-datatype? node)  (derive id :fluree.db/envVar) 
      (connection? node)          (derive id :fluree.db/connection)
      (system? node)              (derive id :fluree.db/remote-system)
      (memory-storage? node)      (derive id :fluree.db.storage/memory)
      (file-storage? node)        (derive id :fluree.db.storage/file)
      (s3-storage? node)          (derive id :fluree.db.storage/s3)
      (ipfs-storage? node)        (derive id :fluree.db.storage/ipfs)
      (ipns-nameservice? node)    (derive id :fluree.db.nameservice/ipns)
      (storage-nameservice? node) (derive id :fluree.db.nameservice/storage))
    node))

(defmethod ig/init-key :fluree.db/envVar  [var]
  (System/getenv (get (keyword "@id") var)) ;; or however that works 

Integrant would only work for the resolution if the configuration value being initialized/loaded was a subject node with an explicit @id. Integrant loads a configuration specified as a map with keyword keys, and those keys serve as the unique identifiers to the configs held in the corresponding value. As this is implemented now, the environment variable is an @value object which doesn't have an id, so we couldn't use integrant to load the variable.

Brian is proposing that we use an implicit subject node to represent environment variables instead of an @value object. We'd be able to implicitly add a blank node id and use integrant to load the variable in that case, but I'm not totally convinced that it would be worth it at that point, but it might be.

@zonotope
Copy link
Contributor Author

zonotope commented Dec 3, 2024

Any thoughts about ability to specifying default values? it might be nice to say have storage path be an env var but also have default if not specified.

Yes, that would be nice. I can't immediately think of a nice way to represent this using the @value object approach for representing these environment variables though.

I think it's another point for Brian's suggestion. We could do something like this if the env vars are represented as implicit subjects instead:

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
        "defaultVal": 32
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "defaultVal": 512
      }
    }
  ]
}

@zonotope
Copy link
Contributor Author

zonotope commented Dec 3, 2024

Also curious about supporting java properties in addition to env vars?

The reason is that I can use System/setProperty in the repl and define values in deps.edn via jvmOpts in the profile. Env vars are readonly and I'm not sure of an easy way to set env vars when starting the repl form vscode. The best workaround I can think of to do this is sourcing an env file in bash and starting repl using command line and then attaching to vscode. Ability to use properties would be more convenient in repl and easier to write unit tests, if env and property resolution code paths are very similar than unit tests for properties could cover most of the env resolution as well.

For this particular use case, I think it would be best to just use a different "development" config file with explicit values that don't use environment variables or system properties at all. The config itself is just data, so you'd be able to change the explicit scalars in the config maps before passing it off to starting the system. That will always be simpler than relying on environment variables or system properties.

In general though, I think we should eventually support Java properties because it's another way people will try to manage system configuration in production. We could and should support Java properties the same way we support env vars because there are lots of parallels between the two. I think either an @value object or subject node approach could work.

Another option could be a modified subject node approach similar to Brian's suggestion, but we would think of the implied/inferred type of the subject node to be ConfigurationValue instead of EnvironmentVariable and set up a priority between configuration methods (say env var takes precedence over java property, and then use the default if neither is set). Then we could do this:

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
        "javaProp": "connectionParallelism",
        "defaultVal": 32
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "javaProp": "cachMaxMb",
        "defaultVal": 512
      }
    }
  ]
}

@zonotope
Copy link
Contributor Author

zonotope commented Dec 3, 2024

Also perhaps not directly related to this PR, but I'm trying to understand how profiles work From what I could see they weren't eg start-file takes profile but its ignored. I did see there is a config/finalize that maybe perhaps was intended can call on config after reading and then perhaps call start-config? I'm mot sure if that worked or not but if so I think they symantics may have changes as it looks like start-config now expects a raw string rather than parsed map

profiles used to work in the old configuration system, but they're as yet unimplemented here in the new JSON-LD based configuration. That's why you see profile arguments scattered throughout the code unused.

They are meant to allow users to use the same config in different environments. Say in development, staging, and production. You would define a base config, and then you'd add a profiles map where the key was the environment and value was overrides that would be merged on top of the base config when you started in that environment. Then you would do (start-resource "config.json" :staging) or whatever to apply that profile.

When it's (re)implemented, a config map with profiles would look something like this:

{
  "@context": {
    "@base": "https://ns.flur.ee/config/main/",
    "@vocab": "https://ns.flur.ee/system#",
    "profiles": {
      "@container": [ "@graph", "@index" ]
    }
  },
  "@id": "standaloneServer",
  "@graph": [
    {
      "@id": "localDiskStorage",
      "@type": "Storage",
      "filePath": "/opt/fluree-server/data"
    },
    {
      "@id": "connection",
      "@type": "Connection",
      "parallelism": 4,
      "cacheMaxMb": 1000,
      "commitStorage": {
        "@id": "localDiskStorage"
      },
      "indexStorage": {
        "@id": "localDiskStorage"
      },
      "primaryPublisher": {
        "@type": "Publisher",
        "storage": {
          "@id": "localDiskStorage"
        }
      }
    },
    {
      "@id": "consensus",
      "@type": "Consensus",
      "consensusProtocol": "standalone",
      "maxPendingTxns": 512,
      "connection": {
        "@id": "connection"
      }
    },
    {
      "@id": "http",
      "@type": "API",
      "httpPort": 8090,
      "maxTxnWaitMs": 120000
    }
  ],
  "profiles": {
    "dev": [
      {
        "@id": "localDiskStorage",
        "filePath": "dev/data"
      },
      {
        "@id": "connection",
        "cacheMaxMb": 200
      },
      {
        "@id": "consensus",
        "maxPendingTxns": 16
      }
    ],
    "staging": [
      {
        "@id": "connection",
        "cacheMaxMb": 512
      },
      {
        "@id": "consensus",
        "maxPendingTxns": 64
      }
    ]
  }
}

@zonotope
Copy link
Contributor Author

zonotope commented Dec 3, 2024

I also find the parsed integrant config map somewhat difficult to read with the angle bracket substitutions ... What is the motivation behind these replacements?

Could we just use string keys instead?

No. Integrant initializes components identified by keyword keys specifically because it uses Clojure's built in keyword hierarchy along with the ig/init-key multimethod to know which method to execute when initializing a component. Note the restriction on Clojure hierarchies from the page I linked:

You can define hierarchical relationships with (derive child parent). Child and parent can be either symbols or keywords, and must be namespace-qualified

This is what the derive-node-id function is doing: it detects what kind of configuration a node is based on both the type and which properties exist, and take the id keyword and derives it in terms of the representative keyword for that kind of node. It has to be either a keyword or a symbol for that to work.

although I found I can type :https://flur.ee as keyword just fine so at least some of these replacements seem unnecessary. I think the symbol I had most trouble with is :@id as clojure doesn't allow the :@ but I've been able to use (keyword "@id") so it seems special characters can still be keywords just maybe not keyword literals.

Here are the official rules for what can appear in a Clojure symbol. The rules for keywords are the same as those for symbols, with some additional rules around colons. The reader itself is much more permissive in what you can get away with (now) because there is no validation when creating keywords/symbols, but we shouldn't rely on keywords that explicitly break the documented rules because that could cause subtle bugs later on. There is also the possibility that the reader could be made more stringent in the future and validate it's input, which would mean that our code would break if we relied on undocumented behavior. Furthermore, slashes, colons, and dots have special meaning in a keyword which are not intended in these config symbols, which is why I also decided to escape those as well.

I opted for this bit of complexity to allow us to use Integrant for initialization. It's a feature rich library for initializing and halting complex component hierarchies, which is exactly what we need here. There needed to be some massaging to transform the JSON-LD config format into an integrant config map in exchange for being able to use all the features integrant provides. The alternative would have been to reimplement what's in integrant ourselves. Also, the compiled integrant config map is an internal representation that users (or even developers) should not have to interact with in practice after everything settles down.

@bplatz
Copy link
Contributor

bplatz commented Dec 4, 2024

I like the below, because it allows for something that has bothered me:

      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "defaultVal": 512
      }

We've had issues where people have a misspelling in env var, and we happily assign a default which in some circumstances could be a significant issue.

I like a default being explicit, and if not default and no envVar, I think we should throw/exit - as it probably is an environment variable not coming through properly.

@dpetran
Copy link
Contributor

dpetran commented Dec 4, 2024

Oh I really like that @bplatz, sensible or at least explicit defaults can make a big difference for usability.

@zonotope
Copy link
Contributor Author

zonotope commented Dec 16, 2024

@fluree/core I've re-worked this based on the discussion above. Instead of using a value node to encode environment variables, I went with an approach inspired by @bplatz's suggestion for using subject nodes with specific properties and, informed by @kurtharriger's questions about default values and java system properties, I've included properties so users can reference environment variables, java system properties, and default values in the same node.

I've set up a hierarchy for these. First, we check the environment variable if that property is specified, then, if the environment variable isn't set or the env var property isn't specified, we check if the java system property property is specified. If that property isn't set ore the java property property isn't specified, we fall back to the default value. if that isn't set, then we throw an exception.

Here is an example json-ld config using this new scheme:

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
        "javaProp": "connectionParallelism",
        "defaultVal": 32
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "javaProp": "cachMaxMb",
        "defaultVal": 512
      }
    }
  ]
}

EDIT:
Users can specify environment variables in place of any scalar value, and users can of course still use scalar values as before and everything will still work as normal.

@kurtharriger
Copy link
Contributor

kurtharriger commented Dec 16, 2024

I would suggest that if both env and java system property is specified the system property should have priority. For 2 reasons:
Environment vars tend to have wider scope than system properties which are specified on java commandline. The more specific the location the higher the preference.
Also env vars cannot be changed at runtime, java properties can. Giving system properties higher preference allows tests to override the value regardless of any env vars present when running tests, if env vars have priority the tests become dependent on the runtime environment and cannot easily override the env vars

@zonotope
Copy link
Contributor Author

I would suggest that if both env and java system property is specified the system property should have priority. For 2 reasons: Environment vars tend to have wider scope than system properties which are specified on java commandline. The more specific the location the higher the preference. Also env vars cannot be changed at runtime, java properties can. Giving system properties higher preference allows tests to override the value regardless of any env vars present when running tests, if env vars have priority the tests become dependent on the runtime environment and cannot easily override the env vars

I think you bring up a good point about the scope arguments, so I'll switch the priority of system properties and environment variables.

As far as the tests go however, we would again just use a different config that wouldn't reference any environment variables or system properties for tests and development. Env vars and java properties really only make sense for production and production like deployments where some configuration values may be sensitive so they should be stored separately from the configuration and executable of the system and injected outside of band at startup. I don't think any of our credentials should be sensitive for development or test environments, so we should use specific configs that contain all the information the development or test systems need to run.

Copy link
Contributor

@kurtharriger kurtharriger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zonotope zonotope merged commit 38a7ab4 into main Dec 16, 2024
@zonotope zonotope deleted the feature/config-overrides branch December 16, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants