Allow environment variables in connection configs by zonotope · Pull Request #943 · fluree/db

zonotope · 2024-11-28T06:50:24Z

This patch allows a user to specify environment variables in json-ld connection configs. The config system will then load the actual config values from the specified environment variable. Users specify the environment variable by setting the "@type" of a value map to "https://ns.flur.ee/system#envVar", as in the following example

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "@value": "FLUREE_CONNECTION_PARALLELISM",
        "@type": "envVar"
      },
      "cacheMaxMb": {
        "@value": "FLUREE_CACHE_MAX_MB",
        "@type": "envVar"
      }
    }
  ]
}

dpetran · 2024-12-02T16:21:31Z

src/clj/fluree/db/connection/config.cljc

+      #?(:clj (or (System/getenv v)
+                  (throw (ex-info (str "Environment variable " v " unset.")
+                                  {:status 400 :error :db/missing-env-var})))
+         :cljs nil) ; TODO: Support environment variable overrides in cljs


I think we should throw in this case rather than emit nil just to be on the safe side.

dpetran

Nice work! When this issue came up I figured it would be really tough to specify env vars for a configuration graph, but the @type "envVar" approach is really elegant.

♻️

zonotope · 2024-12-02T19:16:51Z

Nice work! When this issue came up I figured it would be really tough to specify env vars for a configuration graph, but the @type "envVar" approach is really elegant.

♻️

yeah, implementing this the way i was trying to do it at first was really tough. the old solution was far less flexible both for the user and the implementation, and much less powerful. thankfully, something higher priority came up (vector search) so i laid this down for a bit to work on that. that gave me some unintentional hammock time on this problem that allowed the back of my brain to marinate on it for a while. when the priority for this bumped back up this idea came to me pretty quickly after i started working on it again. i think it's better than my previous solution in every way.

bplatz · 2024-12-02T19:32:43Z

Suggestion on syntax.

I think instead of these holding a custom data type, they should hold instead a node... this IMO aligns much better with RDF

So the node would simply look like:

{
        "envVar": "FLUREE_CACHE_MAX_MB"
}

Which you can image, actually looks more like below once expanded/inferred... but that step doesn't need to happen... just better illustrates what the node would really look like 'virtually':

{
        "@id": "_b:irrelevant"
        "@type": "EnvironmentVariable"
        "envVar": "FLUREE_CACHE_MAX_MB"
        "envVal": "" // string "value" would end up here
}

So to rewrite your example above, it would instead look like below. Where the value for e.g. cacheMaxMb can be an explicit scalar, or it could be a node that specifies the envVar used.

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB"
      }
    }
  ]
}

zonotope · 2024-12-02T20:01:11Z

🤔

Suggestion on syntax.

I think instead of these holding a custom data type, they should hold instead a node... this IMO aligns much better with RDF

So the node would simply look like:
{
        "envVar": "FLUREE_CACHE_MAX_MB"
}
Which you can image, actually looks more like below once expanded/inferred... but that step doesn't need to happen... just better illustrates what the node would really look like 'virtually':
{
        "@id": "_b:irrelevant"
        "@type": "EnvironmentVariable"
        "envVar": "FLUREE_CACHE_MAX_MB"
        "envVal": "" // string "value" would end up here
}
So to rewrite your example above, it would instead look like below. Where the value for e.g. cacheMaxMb can be an explicit scalar, or it could be a node that specifies the envVar used.
{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB"
      }
    }
  ]
}

kurtharriger · 2024-12-03T00:07:09Z

I'm curious if you could leverage integrant to do the resolution
I noticed that for other systems you use type to specify a hierarchy for integrant keys so maybe it woudl be more consistent to do something like this?

(defn derive-node-id
  [node]
  (let [id (get-id node)]
    (cond
      (env-var-datatype? node)  (derive id :fluree.db/envVar) 
      (connection? node)          (derive id :fluree.db/connection)
      (system? node)              (derive id :fluree.db/remote-system)
      (memory-storage? node)      (derive id :fluree.db.storage/memory)
      (file-storage? node)        (derive id :fluree.db.storage/file)
      (s3-storage? node)          (derive id :fluree.db.storage/s3)
      (ipfs-storage? node)        (derive id :fluree.db.storage/ipfs)
      (ipns-nameservice? node)    (derive id :fluree.db.nameservice/ipns)
      (storage-nameservice? node) (derive id :fluree.db.nameservice/storage))
    node))

(defmethod ig/init-key :fluree.db/envVar  [var]
  (System/getenv (get (keyword "@id") var)) ;; or however that works

kurtharriger · 2024-12-03T00:07:30Z

Any thoughts about ability to specifying default values?
it might be nice to say have storage path be an env var but also have default if not specified.

kurtharriger · 2024-12-03T00:07:43Z

Also curious about supporting java properties in addition to env vars?

The reason is that I can use System/setProperty in the repl and define values in deps.edn via jvmOpts in the profile. Env vars are readonly and I'm not sure of an easy way to set env vars when starting the repl form vscode. The best workaround I can think of to do this is sourcing an env file in bash and starting repl using command line and then attaching to vscode. Ability to use properties would be more convenient in repl and easier to write unit tests, if env and property resolution code paths are very similar than unit tests for properties could cover most of the env resolution as well.

kurtharriger · 2024-12-03T00:12:01Z

Also perhaps not directly related to this PR, but I'm trying to understand how profiles work
From what I could see they weren't eg start-file takes profile but its ignored. I did see there is a config/finalize that maybe perhaps was intended can call on config after reading and then perhaps call start-config? I'm mot sure if that worked or not but if so I think they symantics may have changes as it looks like start-config now expects a raw string rather than parsed map

kurtharriger · 2024-12-03T00:42:13Z

I also find the parsed integrant config map somewhat difficult to read with the angle bracket substitutions ...
What is the motivation behind these replacements? Could we just use string keys instead? although I found I can type :https://flur.ee as keyword just fine so at least some of these replacements seem unnecessary. I think the symbol I had most trouble with is :@id as clojure doesn't allow the :@ but I've been able to use (keyword "@id") so it seems special characters can still be keywords just maybe not keyword literals.

zonotope · 2024-12-03T04:56:22Z

I'm curious if you could leverage integrant to do the resolution I noticed that for other systems you use type to specify a hierarchy for integrant keys so maybe it woudl be more consistent to do something like this?

(defn derive-node-id
  [node]
  (let [id (get-id node)]
    (cond
      (env-var-datatype? node)  (derive id :fluree.db/envVar) 
      (connection? node)          (derive id :fluree.db/connection)
      (system? node)              (derive id :fluree.db/remote-system)
      (memory-storage? node)      (derive id :fluree.db.storage/memory)
      (file-storage? node)        (derive id :fluree.db.storage/file)
      (s3-storage? node)          (derive id :fluree.db.storage/s3)
      (ipfs-storage? node)        (derive id :fluree.db.storage/ipfs)
      (ipns-nameservice? node)    (derive id :fluree.db.nameservice/ipns)
      (storage-nameservice? node) (derive id :fluree.db.nameservice/storage))
    node))

(defmethod ig/init-key :fluree.db/envVar  [var]
  (System/getenv (get (keyword "@id") var)) ;; or however that works

Integrant would only work for the resolution if the configuration value being initialized/loaded was a subject node with an explicit @id. Integrant loads a configuration specified as a map with keyword keys, and those keys serve as the unique identifiers to the configs held in the corresponding value. As this is implemented now, the environment variable is an @value object which doesn't have an id, so we couldn't use integrant to load the variable.

Brian is proposing that we use an implicit subject node to represent environment variables instead of an @value object. We'd be able to implicitly add a blank node id and use integrant to load the variable in that case, but I'm not totally convinced that it would be worth it at that point, but it might be.

zonotope · 2024-12-03T05:02:42Z

Any thoughts about ability to specifying default values? it might be nice to say have storage path be an env var but also have default if not specified.

Yes, that would be nice. I can't immediately think of a nice way to represent this using the @value object approach for representing these environment variables though.

I think it's another point for Brian's suggestion. We could do something like this if the env vars are represented as implicit subjects instead:

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
        "defaultVal": 32
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "defaultVal": 512
      }
    }
  ]
}

zonotope · 2024-12-03T05:26:40Z

Also curious about supporting java properties in addition to env vars?

The reason is that I can use System/setProperty in the repl and define values in deps.edn via jvmOpts in the profile. Env vars are readonly and I'm not sure of an easy way to set env vars when starting the repl form vscode. The best workaround I can think of to do this is sourcing an env file in bash and starting repl using command line and then attaching to vscode. Ability to use properties would be more convenient in repl and easier to write unit tests, if env and property resolution code paths are very similar than unit tests for properties could cover most of the env resolution as well.

For this particular use case, I think it would be best to just use a different "development" config file with explicit values that don't use environment variables or system properties at all. The config itself is just data, so you'd be able to change the explicit scalars in the config maps before passing it off to starting the system. That will always be simpler than relying on environment variables or system properties.

In general though, I think we should eventually support Java properties because it's another way people will try to manage system configuration in production. We could and should support Java properties the same way we support env vars because there are lots of parallels between the two. I think either an @value object or subject node approach could work.

Another option could be a modified subject node approach similar to Brian's suggestion, but we would think of the implied/inferred type of the subject node to be ConfigurationValue instead of EnvironmentVariable and set up a priority between configuration methods (say env var takes precedence over java property, and then use the default if neither is set). Then we could do this:

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
        "javaProp": "connectionParallelism",
        "defaultVal": 32
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "javaProp": "cachMaxMb",
        "defaultVal": 512
      }
    }
  ]
}

zonotope · 2024-12-03T05:38:36Z

Also perhaps not directly related to this PR, but I'm trying to understand how profiles work From what I could see they weren't eg start-file takes profile but its ignored. I did see there is a config/finalize that maybe perhaps was intended can call on config after reading and then perhaps call start-config? I'm mot sure if that worked or not but if so I think they symantics may have changes as it looks like start-config now expects a raw string rather than parsed map

profiles used to work in the old configuration system, but they're as yet unimplemented here in the new JSON-LD based configuration. That's why you see profile arguments scattered throughout the code unused.

They are meant to allow users to use the same config in different environments. Say in development, staging, and production. You would define a base config, and then you'd add a profiles map where the key was the environment and value was overrides that would be merged on top of the base config when you started in that environment. Then you would do (start-resource "config.json" :staging) or whatever to apply that profile.

When it's (re)implemented, a config map with profiles would look something like this:

{
  "@context": {
    "@base": "https://ns.flur.ee/config/main/",
    "@vocab": "https://ns.flur.ee/system#",
    "profiles": {
      "@container": [ "@graph", "@index" ]
    }
  },
  "@id": "standaloneServer",
  "@graph": [
    {
      "@id": "localDiskStorage",
      "@type": "Storage",
      "filePath": "/opt/fluree-server/data"
    },
    {
      "@id": "connection",
      "@type": "Connection",
      "parallelism": 4,
      "cacheMaxMb": 1000,
      "commitStorage": {
        "@id": "localDiskStorage"
      },
      "indexStorage": {
        "@id": "localDiskStorage"
      },
      "primaryPublisher": {
        "@type": "Publisher",
        "storage": {
          "@id": "localDiskStorage"
        }
      }
    },
    {
      "@id": "consensus",
      "@type": "Consensus",
      "consensusProtocol": "standalone",
      "maxPendingTxns": 512,
      "connection": {
        "@id": "connection"
      }
    },
    {
      "@id": "http",
      "@type": "API",
      "httpPort": 8090,
      "maxTxnWaitMs": 120000
    }
  ],
  "profiles": {
    "dev": [
      {
        "@id": "localDiskStorage",
        "filePath": "dev/data"
      },
      {
        "@id": "connection",
        "cacheMaxMb": 200
      },
      {
        "@id": "consensus",
        "maxPendingTxns": 16
      }
    ],
    "staging": [
      {
        "@id": "connection",
        "cacheMaxMb": 512
      },
      {
        "@id": "consensus",
        "maxPendingTxns": 64
      }
    ]
  }
}

zonotope · 2024-12-03T15:54:07Z

I also find the parsed integrant config map somewhat difficult to read with the angle bracket substitutions ... What is the motivation behind these replacements?

Could we just use string keys instead?

No. Integrant initializes components identified by keyword keys specifically because it uses Clojure's built in keyword hierarchy along with the ig/init-key multimethod to know which method to execute when initializing a component. Note the restriction on Clojure hierarchies from the page I linked:

You can define hierarchical relationships with (derive child parent). Child and parent can be either symbols or keywords, and must be namespace-qualified

This is what the derive-node-id function is doing: it detects what kind of configuration a node is based on both the type and which properties exist, and take the id keyword and derives it in terms of the representative keyword for that kind of node. It has to be either a keyword or a symbol for that to work.

although I found I can type :https://flur.ee as keyword just fine so at least some of these replacements seem unnecessary. I think the symbol I had most trouble with is :@id as clojure doesn't allow the :@ but I've been able to use (keyword "@id") so it seems special characters can still be keywords just maybe not keyword literals.

Here are the official rules for what can appear in a Clojure symbol. The rules for keywords are the same as those for symbols, with some additional rules around colons. The reader itself is much more permissive in what you can get away with (now) because there is no validation when creating keywords/symbols, but we shouldn't rely on keywords that explicitly break the documented rules because that could cause subtle bugs later on. There is also the possibility that the reader could be made more stringent in the future and validate it's input, which would mean that our code would break if we relied on undocumented behavior. Furthermore, slashes, colons, and dots have special meaning in a keyword which are not intended in these config symbols, which is why I also decided to escape those as well.

I opted for this bit of complexity to allow us to use Integrant for initialization. It's a feature rich library for initializing and halting complex component hierarchies, which is exactly what we need here. There needed to be some massaging to transform the JSON-LD config format into an integrant config map in exchange for being able to use all the features integrant provides. The alternative would have been to reimplement what's in integrant ourselves. Also, the compiled integrant config map is an internal representation that users (or even developers) should not have to interact with in practice after everything settles down.

bplatz · 2024-12-04T15:43:51Z

I like the below, because it allows for something that has bothered me:

      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "defaultVal": 512
      }

We've had issues where people have a misspelling in env var, and we happily assign a default which in some circumstances could be a significant issue.

I like a default being explicit, and if not default and no envVar, I think we should throw/exit - as it probably is an environment variable not coming through properly.

dpetran · 2024-12-04T16:19:20Z

Oh I really like that @bplatz, sensible or at least explicit defaults can make a big difference for usability.

zonotope · 2024-12-16T03:31:47Z

@fluree/core I've re-worked this based on the discussion above. Instead of using a value node to encode environment variables, I went with an approach inspired by @bplatz's suggestion for using subject nodes with specific properties and, informed by @kurtharriger's questions about default values and java system properties, I've included properties so users can reference environment variables, java system properties, and default values in the same node.

I've set up a hierarchy for these. First, we check the environment variable if that property is specified, then, if the environment variable isn't set or the env var property isn't specified, we check if the java system property property is specified. If that property isn't set ore the java property property isn't specified, we fall back to the default value. if that isn't set, then we throw an exception.

Here is an example json-ld config using this new scheme:

{
  "@context": {
    "@base": "https://ns.flur.ee/dev/config/",
    "@vocab": "https://ns.flur.ee/system#"
  },
  "@id": "testSystem",
  "@graph": [
    ...
    {
      "@id": "testConnection",
      "@type": "Connection",
      "parallelism": {
        "envVar": "FLUREE_CONNECTION_PARALLELISM",
        "javaProp": "connectionParallelism",
        "defaultVal": 32
      },
      "cacheMaxMb": {
        "envVar": "FLUREE_CACHE_MAX_MB",
        "javaProp": "cachMaxMb",
        "defaultVal": 512
      }
    }
  ]
}

EDIT:
Users can specify environment variables in place of any scalar value, and users can of course still use scalar values as before and everything will still work as normal.

kurtharriger · 2024-12-16T03:42:17Z

I would suggest that if both env and java system property is specified the system property should have priority. For 2 reasons:
Environment vars tend to have wider scope than system properties which are specified on java commandline. The more specific the location the higher the preference.
Also env vars cannot be changed at runtime, java properties can. Giving system properties higher preference allows tests to override the value regardless of any env vars present when running tests, if env vars have priority the tests become dependent on the runtime environment and cannot easily override the env vars

zonotope · 2024-12-16T14:16:37Z

I would suggest that if both env and java system property is specified the system property should have priority. For 2 reasons: Environment vars tend to have wider scope than system properties which are specified on java commandline. The more specific the location the higher the preference. Also env vars cannot be changed at runtime, java properties can. Giving system properties higher preference allows tests to override the value regardless of any env vars present when running tests, if env vars have priority the tests become dependent on the runtime environment and cannot easily override the env vars

I think you bring up a good point about the scope arguments, so I'll switch the priority of system properties and environment variables.

As far as the tests go however, we would again just use a different config that wouldn't reference any environment variables or system properties for tests and development. Env vars and java properties really only make sense for production and production like deployments where some configuration values may be sensitive so they should be stored separately from the configuration and executable of the system and injected outside of band at startup. I don't think any of our credentials should be sensitive for development or test environments, so we should use specific configs that contain all the information the development or test systems need to run.

kurtharriger

LGTM

zonotope added 11 commits November 6, 2024 19:24

move config parsing to initialization fn

26d2d00

Merge remote-tracking branch 'origin/main' into feature/config-overrides

6fcabfb

add a function to initialize pre-parsed configs

1608c46

cleanup

fbe1b45

add EnvVar datatype

37a70bb

add a function to get the data type, but only for a value node

6d51068

add functions to consider env vars when getting config values

ec74572

throw for unset env vars

055b0e6

add get-values fn

93befe5

parse/coerce config values into integers and strings

229e444

use lowerCamelCase for env var datatype

f92ff57

zonotope requested a review from a team November 28, 2024 06:50

zonotope self-assigned this Nov 28, 2024

add functions to parse boolean values from configs

246202e

zonotope mentioned this pull request Nov 28, 2024

Allow environment variables in server configs fluree/server#101

Merged

dpetran reviewed Dec 2, 2024

View reviewed changes

dpetran approved these changes Dec 2, 2024

View reviewed changes

throw if trying to read an env var in cljs

39c2ffc

zonotope added 5 commits December 14, 2024 12:07

Merge remote-tracking branch 'origin/main' into feature/config-overrides

c26eb55

add configuration value subject nodes

15fa386

remove env-var parsing based on data types

9a06b8a

remove nil values when getting strings

fcbd289

initialize configuration value subject nodes with integrant

f8212ee

make java properties take precedence over environment variables

0bffbf6

kurtharriger approved these changes Dec 16, 2024

View reviewed changes

zonotope merged commit 38a7ab4 into main Dec 16, 2024

zonotope deleted the feature/config-overrides branch December 16, 2024 22:52

Conversation

zonotope commented Nov 28, 2024

Uh oh!

dpetran Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

dpetran left a comment

Choose a reason for hiding this comment

Uh oh!

zonotope commented Dec 2, 2024

Uh oh!

bplatz commented Dec 2, 2024

Uh oh!

zonotope commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtharriger commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtharriger commented Dec 3, 2024

Uh oh!

kurtharriger commented Dec 3, 2024

Uh oh!

kurtharriger commented Dec 3, 2024

Uh oh!

kurtharriger commented Dec 3, 2024

Uh oh!

zonotope commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zonotope commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zonotope commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zonotope commented Dec 3, 2024

Uh oh!

zonotope commented Dec 3, 2024

Uh oh!

bplatz commented Dec 4, 2024

Uh oh!

dpetran commented Dec 4, 2024

Uh oh!

zonotope commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtharriger commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zonotope commented Dec 16, 2024

Uh oh!

kurtharriger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zonotope commented Dec 2, 2024 •

edited

Loading

kurtharriger commented Dec 3, 2024 •

edited

Loading

zonotope commented Dec 3, 2024 •

edited

Loading

zonotope commented Dec 3, 2024 •

edited

Loading

zonotope commented Dec 3, 2024 •

edited

Loading

zonotope commented Dec 16, 2024 •

edited

Loading

kurtharriger commented Dec 16, 2024 •

edited

Loading