feat!: Reorganize pipelines and infra files into their respective folders#292
Merged
Conversation
4 tasks
leahecole
approved these changes
Feb 10, 2022
leahecole
left a comment
Contributor
There was a problem hiding this comment.
Non blocking nits. Otherwise, lgtm
| - GCS bucket to store final, downstream, customer-facing data | ||
| - Sometimes, for very large datasets, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) job | ||
| - GCS bucket to store downstream data, such as those linked to in the [Datasets Marketplace](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset). | ||
| - Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (i.e. Apache Beam) job |
Contributor
There was a problem hiding this comment.
nit
Suggested change
| - Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (i.e. Apache Beam) job | |
| - Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (Apache Beam) job |
| # Environment Setup | ||
|
|
||
| We use Pipenv to make environment setup more deterministic and uniform across different machines. | ||
| We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using the instructions found [here](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv). |
Contributor
There was a problem hiding this comment.
Suggested change
| We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using the instructions found [here](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv). | |
| We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using these [instructions](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv). |
nit - more screenreader friendly hyperlink
| In addition, the command above creates a "dot env" directory in the project root. The directory name is the value you set for `--env`. If it's not set, the value defaults to `dev` which generates the `.dev` folder. | ||
|
|
||
| Consider this "dot" directory as your own dedicated space for prototyping. The files and variables created in that directory will use an isolated environment. All such directories are gitignored. | ||
| Consider a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. As will be seen later, this directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, i.e. all dot directories are gitignored. |
Contributor
There was a problem hiding this comment.
small nits
Suggested change
| Consider a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. As will be seen later, this directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, i.e. all dot directories are gitignored. | |
| We strongly recommend using a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. This directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, meaning that all dot directories are gitignored. |
happyhuman
approved these changes
Feb 10, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR sets the stage for adding more categories (subfolders) in every
datasets/$DATASETfolder, such as the upcomingdocsfolder for the datasets' documentation set.To do this, we need to change how the files and folders under every dataset is organized.
From
to introducing two levels
infraandpipelines:which allows us to also add a
docsfolder. When the docset feature is ready, the hierarchy will look likeand we can keep adding other domain-specific folders as necessary without affecting infra- or pipelines-related stuff.
Checklist
Note: If an item applies to you, all of its sub-items must be fulfilled
READMEaccordinglydatasets/<DATASET_NAME>and nothing outside of that directorytestsfolder)