-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Add BigQueryToPostgresOperator #30658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BigQueryToPostgresOperator #30658
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
pankajastro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually, we keep a transfer operator in Google provider if data move from some services to Google services like from Postgres to Bigquery or from s3 to Bigquery/GCS so not sure if Google provider is the right place for this transfer operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not catch something like dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the implementation. I am now using a BaseBigQueryToSqlOperator. In order to remain as backward compatible as possible, I used the old implementation tied to the MySQL operator.
If the user does not provide a correct input it will fail at runtime anyway (in the execute method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why fmt off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially copy/pasted the MySQL implementation without thinking too much about it. After digging in the commit history it looks like those fmt directives are old artifacts that can be removed with no side effect, which I did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add a test to check bigquery_get_data has been called correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current assertion already ensures bigquery_get_data has been called as a side effect:
mock_hook.return_value.list_rows.assert_called_once_with(
list_rows comes from bigquery_get_data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docs string for this param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed those missing docstrings
We should not duplicate code. We should have either As for the |
I'm up to it, will happen during this week |
I am new to AirFlow so I don't have an opinion on this. I added the module next to other similar operators ( |
Summarizing: yes the current location is good. cc: @pankajastro : The rule is a bit more complex here and not only based on whether the provider is on the "To" or "From" side of the transfer. In most cases, yes, the "To" side is more important, but not for the reason you think. The reason is that usually the "To" side (means - service) is more interested in receiving the data and is more lilely to be maintained by whoever puts the focus on the "to" provider. But it only makes sense if you have provider like "S3 <=> GCS" - where you have Amazon on one side and Google on the other. In this case Google is more insterested to maintain For Bigquery to MySQL or Postgres , this is not as easy - because there is no "service provider" who would be interested in getting data to Postgres / MySQL, because those are generic databases, and while they have soime "stakeholders" those stakeholders do not own the service that they would be interested to get data in. And in this case Google is determined to be the "more likely to maintain it". This is succintly described in our CONTRIBUTING.rst:
|
9458e98 to
6aeb2d1
Compare
@eladkal ok, deal. I refactored the two twin operators with a I took the opportunity to factorize docstrings. That said, I'm not 100% sure that documenting only two parameters in The generated documentation looks fine. Here is an example (I also checked the other impacted pages): The only tests I have not been able to run with Breeze are the example dags in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jbbqqf - just noticed we are getting the class name split across multiple lines in the documentation, is there anything we could do to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phanikumv I don't want to fix this issue in this PR. The diff is not the cause of the issue from what I can see on Airflow's latest online documentation (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/bigquery_to_mysql/index.html).
That said, this is how I would have done it:
thanks for the detailed explanation makes sense 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it makes sense to combine Postgres and MySQL system test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you'd suggest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting it due to the similarities between the files?
While it may improve running time, I would rather keep the test files specific to their intended purpose. I believe that combining different SQL providers into a single file may cause confusion when trying to identify the source of an operator failure during debugging.
Separating them into distinct files should make the debugging process more manageable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Better to keep them separate. In this case usability of those examples trumps DRY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting request changes just to make sure #30748 is merged before this PR so the delegate_to parameter won't be in BaseBigQueryToSqlOperator
|
@jbbqqf you can now rebase and the delegate_to won't be an issue as it was removed in main |
6aeb2d1 to
c744f08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed from main. This parameter doesn't exist any more
6e711ea to
eaecfb1
Compare
|
I made everything I could with the feedback I've received on this PR. The code is factorized so that the new PostgresOperator is not a copy/paste version of the Mysql one. The documentation is also as factorized as it can be. I rebased and removed the I built the documentation locally. It looks fine. I tested operators with a local airflow / postgresql / mysql + test gcp project. Basic operations work. By basic operation I mean I tested those tasks: During my tests I noticed there was an error with I performed my local tests with a postgres and a mysql database. I didn't manually test the mssql operator. Is there some additional tests I should perform that would cover common pitfalls? @eladkal you seem to have some perspective on the airflow project. What is the next step? How many people should validate the PR so that it can be merged? 🙂 |
To start - maybe get all the tests succeed :D (they are failing currently)? |
|
And the second thing -> wait for a maintainer to merge it. And sometimes be patient with it @jbbqqf. Yes it takes sometimes a number of passes, multiple builds, conflict resolving and sometimes even few positive reviews to pass. All that so that your change will be good, reviewed, tested, and being ready for anyone else to take over and develop further. Yes it can be annoying but if you consider that you might disappear tomorrow and there are a number of people here who will take over the maintenance of the code with confidence. I hope you understand all that. |
eladkal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK to me.
left one comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to deprecate this parameter and replace it with a unified parameter in the base class (target_table_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I changed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd emphasize here that it's a base operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring and remove the pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion it is unnecessary because concrete implementations are self-explanatory.
I have nonetheless taken into account your remark by adding a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited to explain myself better
I understand your opinion that it might seem redundant, however:
- Sometimes methods that look trivial might do something completely different than what you think they do - especially in an open source where many people with different code styles take part.
- Putting opinions aside, it's required by PEP 257, at least for public methods :)
Therefore, I suggest getting used to writing docstrings starting with self-explanatory methods, even as a preparation for more complicated functions. That being said, for simple methods - one line could do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public methods (including the init constructor) should also have docstrings.
The wording of PEP 257 implies it's optional.
In Clean Code, Robert Martin enumerates a collection of smells and heuristics including:
C3: Redundant Comment
A comment is redundant if it describes something that adequately describes itself. For example:
i++; // increment iAnother example is a Javadoc that says nothing more than (or even less than) the function signature.
Comments should say things that the code cannot say for itself.
As suggested, I put my opinion aside and described persist_links with """This function persists the connection to the SQL provider""".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is it trivial to reference all children implementation / calls to this function the best added value I could provide was to inform future readers what the history of this method is.
If you have another opinion, can you propose a docstring you think would fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to overcomplicate :)
A one-liner describing what this function does would do better (e.g.,: This function persists the connection to the SQL provider). The history of this method in this context seems irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting it due to the similarities between the files?
While it may improve running time, I would rather keep the test files specific to their intended purpose. I believe that combining different SQL providers into a single file may cause confusion when trying to identify the source of an operator failure during debugging.
Separating them into distinct files should make the debugging process more manageable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could pass source_project_dataset_table directly to dataset_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you compare the old vs new version, it would introduce a slight difference of behavior between the two versions. It can lead to production issues.
I'd keep it this way because, unless I am mistaken, this version achieves 100% backward compatibility.
I agree that it doesn't make sense to have two distinct parameters for this use case. I searched in the google provider for helper functions to parse project.dataset.table vs dataset.table. I didn't find any.
If you insist on deprecating something here I'd vote to deprecate it in another PR because this one is already becoming quite complex to manage / review ✌️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, now I realize that I missed the project definition in the dataset_table :)
It could wait for another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏
eaecfb1 to
38fc173
Compare
Touché 😉 @potiuk Actually when I posted my comment, CI was all green and I thought for a moment I had solved the issue. As I can see it takes time so that all pipelines finish. I tried to run those tests with breeze but I can't manage to find the appropriate command that would specifically target the modified operators (running static checks with breeze is OK though). Can you suggest arguments I should pass to this tool to understand locally what's going on ? (CI failures don't seem to output verbose enough logs)
Sure 👍 I read in Airflow's contributing guide that being pushy with maintainers was interpreted well, as a sign of interest. This was my intent. I will be patient as long as necessary. I appreciate your efforts to demonstrate diplomacy. Reference: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst
|
Complain to GitHub - not to us, this is an isue I reported to them ~ 2 years ago and there is no fix in sight.
All the test types of airflow, including the way how to distribute them is decribed in https://github.com/apache/airflow/blob/main/TESTING.rst Generally speaking - seems that unit tests are failing for you and you should be able to follow "unit test" chapter there. If you follow the instructions there, it should help. Also Breeze commands are described in https://github.com/apache/airflow/blob/main/BREEZE.rst - including help and screen outputs. You can also run
There is a word "considerate". Have you considered that people responding here have likley even 100s of PRs to look at, while you have one to contribute? and that they are often doing it in their free time and people are demanding attention becasue they think they have to get response NOW? If not - you can take a look at my talk about empathy: https://airflowsummit.org/sessions/2022/hey-maintainer-exercise-your-empathy/ Mildly annoying does not mean unempathettic. |
38fc173 to
c459c42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use AirflowProviderDeprecationWarning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not do target_table_name or mssql_table
it makes it harder to understand where the "real" value is.
we should handle the value as part of the deprecation check. If user passed mssql_table then we also assign it to self. target_table_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand what you have in mind. Something like this?
selected_target_table_name = target_table_name
if mssql_table is not None:
selected_target_table_name = mssql_table
warnings.warn(
"The `mssql_table` parameter has been deprecated. Use `target_table_name` instead.",
DeprecationWarning,
)
super().__init__(
target_table_name=selected_target_table_name,
dataset_table=f"{dataset_id}.{table_id}",
**kwargs,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
airflow/airflow/providers/slack/operators/slack.py
Lines 210 to 221 in 23b5b31
| if channel: | |
| warnings.warn( | |
| "Argument `channel` is deprecated and will removed in a future releases. " | |
| "Please use `channels` instead.", | |
| DeprecationWarning, | |
| stacklevel=2, | |
| ) | |
| if channels: | |
| raise ValueError(f"Cannot set both arguments: channel={channel!r} and channels={channels!r}.") | |
| channels = channel | |
| self.channels = channels |
basically once you pass that block
New parameter takes precedence but if older parameter is supplied we will assign it to the new parameter + raise warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is missing from the added docs. is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was inspired by the implementation of BaseSQLToGCSOperator. This is also an abstract class which is not documented either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking about the BigQueryToMsSqlOperator not on the base class :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is intentional. There was a slight difference between the MySQL documentation and the MsSQL documentation. I standardized the location of the example snippet. I used the MsSQL version. It did not require to change the MsSQL version, but this is why there is a change on the MySQL version.
As for the new interface (s/mssql_table/target_table_name/), it has been changed directly in example_bigquery_to_mssql.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking about the rst files.
docs/apache-airflow-providers-google/operators/transfer/bigquery_to_mssql.rst
Is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbbqqf can you add docs/apache-airflow-providers-google/operators/transfer/bigquery_to_mssql.rst
?
b50b591 to
d483676
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Copying data from one BigQuery table to another is performed with the | |
| Copying data from BigQuery table to Postgres table is performed with the |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small NIT: in the docs, otherwise it looks good to me - @eladkal ?
d483676 to
4948e8a
Compare
4948e8a to
0bed8a8
Compare
|
@jbbqqf can you add |
Right. sorry! |
eladkal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
will merge when CI is green
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
|
looking for next ones :) |


Yet another operator.
I added that transfer operator by mimicry with the existing
BigQueryToMySqlOperator.I am aware this operator adds duplication with
BigQueryToMySqlOperatorandBigQueryToMsSqlOperator. The implementation is basically the same. On the other hand, I feel that factorizing that code would make it unreadable.The
BigQueryToMySqlOperatorhad a deprecated argument:delegate_to. I removed it from thePostgresadaptation.I noticed the
BigQueryToMySqlOperatoris referenced inairflow/contrib/operators/__init__.py(__deprecated_classes) and intests/always/test_project_structure.py(MISSING_EXAMPLES_FOR_CLASSES). I am not 100% sure of what those references mean but I guess the newBigQueryToPostgresOperatordoesn't need to be referenced here.