Skip to content

feat: Add a wait strategy that waits until the ADO.NET database is available#1401

Merged
HofmeisterAn merged 12 commits intotestcontainers:developfrom
0xced:UntilDatabaseIsAvailable
Apr 30, 2025
Merged

feat: Add a wait strategy that waits until the ADO.NET database is available#1401
HofmeisterAn merged 12 commits intotestcontainers:developfrom
0xced:UntilDatabaseIsAvailable

Conversation

@0xced
Copy link
Contributor

@0xced 0xced commented Mar 15, 2025

What does this PR do?

This pull request introduces a new wait strategy for containers implementing the IDatabaseContainer interface. It tries to open a connection with the container connection string until it works or until it times out.

Why is it important?

This new wait strategy is quite powerful as it can be used for all database containers. It could have been used to workaround #1383 and it could be used as an alternative to #1392.

Related issues

Mentioned above.

How to test this PR

New tests have been added to use this new wait strategy for database containers in addition to the existing wait strategies.

Follow-ups

Explicitly supporting PostgreSQL < 9.3 should be done in a follow-up pull request by using this new wait strategy.

@netlify
Copy link

netlify bot commented Mar 15, 2025

Deploy Preview for testcontainers-dotnet ready!

Name Link
🔨 Latest commit 39500e0
🔍 Latest deploy log https://app.netlify.com/sites/testcontainers-dotnet/deploys/68125b3fc1818d00081f332f
😎 Deploy Preview https://deploy-preview-1401--testcontainers-dotnet.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Collaborator

@HofmeisterAn HofmeisterAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the Db2 tests as well. If you're busy, I can take care of both tasks.

@0xced
Copy link
Contributor Author

0xced commented Apr 11, 2025

Now using a DbContainerFixture in Db2ContainerTest (done in a15cfc5)

@0xced
Copy link
Contributor Author

0xced commented Apr 11, 2025

Now using a DbContainerFixture in FirebirdSqlContainerTest too (done in ca8a178)

And here something extremely interesting happened! The test executes in 2 seconds when using the UntilDatabaseIsAvailable vs 30 seconds for the default wait strategy!

FirebirdSqlDefault vs FirebirdSqlWaitForDatabase

@0xced 0xced force-pushed the UntilDatabaseIsAvailable branch from 27610b2 to ca8a178 Compare April 11, 2025 20:37
Copy link
Collaborator

@HofmeisterAn HofmeisterAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the adjustments.

Unfortunately, I'm running into an issue locally when I run the Cassandra tests multiple times.

The ConnectionStateReturnsOpen() test fails for CassandraContainerDefaultConfiguration when I run all the Cassandra tests together (including CassandraFixtureWaitForDatabase). But if I run just the CassandraContainerDefaultConfiguration tests on their own, everything passes.

All hosts tried for query failed (tried 127.0.0.1:52448: SocketException 'A request to send or receive data was disallowed because the socket is not connected and (when sending on a datagram socket using a sendto call) no address was supplied.')

I believe this is a connection pooling issue. I think I've seen something similar before.

Copy link
Collaborator

@HofmeisterAn HofmeisterAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the remaining test projects as well. They now all use the same structure and test at least the default and overridden wait strategy configurations.

I'll update the module documentation later today. At least, we need to review it, as it might no longer be accurate.

IMO, the PR looks good. We just need to figure out why Cassandra is failing and check whether other modules are experiencing similar issues.

Edit:

On CI Cassandra failed with a similar issue:

Cassandra.NoHostAvailableException : All hosts tried for query failed (tried 127.0.0.1:32769: SocketException 'Transport endpoint is not connected')
Raw output

@0xced
Copy link
Contributor Author

0xced commented Apr 15, 2025

On CI Cassandra failed with a similar issue

Of course, it works on my machine. 🤷🏻‍♂️

I tried something in 3d22bf5 for FirebirdSql. This could be applied to all other db modules too.

The default wait strategy automatically switches to the new UntilDatabaseIsAvailable strategy if the DbProviderFactory is available at runtime (probed through reflection in order not to take a hard dependency on the ADO.NET provider). This requires two tests projects, one that does not reference the ADO.NET provider and one that does.

What do you think? Good idea or not?

Note that applying the same mechanism to the PostgreSQL module would automatically add support for older images where pg_isready is not available, fixing #1392 without relying on specific log messages.

@HofmeisterAn
Copy link
Collaborator

Of course, it works on my machine. 🤷🏻‍♂️

I've tried it on two different machines and it's failing in CI. I'll take a look at it after the public holidays.

What do you think? Good idea or not?

I'd prefer to discuss this in a separate PR. The PR looks good overall, except for the necessary Cassandra fix.

I'm not a huge reflection fan (I guess no one is 😆), but it's an interesting idea. It could help support more configurations and might even be faster in some cases. But we definitely need a way to opt out or fallback.

@0xced 0xced force-pushed the UntilDatabaseIsAvailable branch from d1a6578 to aa2d3a9 Compare April 20, 2025 07:48
@0xced
Copy link
Contributor Author

0xced commented Apr 20, 2025

I'd prefer to discuss this in a separate PR. The PR looks good overall, except for the necessary Cassandra fix.

OK, I removed the commits that use reflection to get the DbProviderFactory instances and force-pushed. I kept the work in a new UntilDatabaseIsAvailable+Reflection branch to discuss later.

@0xced 0xced force-pushed the UntilDatabaseIsAvailable branch from d1a6578 to aa2d3a9 Compare April 20, 2025 07:53
@kiview
Copy link
Member

kiview commented Apr 28, 2025

For Postgres a specific hint, relying on a single SQL query (or even on pg_isready for that matter), is usually a tight race condition on the vanilla PostgreSQL image, since the engine will restart after bootstrapping (and strategy might succeed before the restart).

@HofmeisterAn
Copy link
Collaborator

HofmeisterAn commented Apr 28, 2025

For Postgres a specific hint, relying on a single SQL query (or even on pg_isready for that matter), is usually a tight race condition on the vanilla PostgreSQL image, since the engine will restart after bootstrapping (and strategy might succeed before the restart).

For pg_isready, the race condition won't happen if you specify the host argument AFAIK. That's what we've been using for a long time (we did have issues without the host argument, indeed).

@HofmeisterAn
Copy link
Collaborator

I had some time to look into the Cassandra issue. It looks like a connection pooling issue. The test that opens the DB connection fails after a few runs. The connection string in the exception message doesn't match the one that's actually configured and used by/for the container:

  • Exception message: All hosts tried for query failed (tried 127.0.0.1:51732: SocketException '[...]')
  • Connection string: contact points=127.0.0.1;port=51742

The only way I found to configure pooling is through Cluster.Builder(), not through ADO.NET (nor can we clear the pool).

Not sure how to move forward from here. Maybe we can ask for guidance in the upstream repo (I believe this is a general issue with the client implementation).

WDYT?

@0xced
Copy link
Contributor Author

0xced commented Apr 28, 2025

I was finally able to reproduce the problem by changing the CassandraContainerTest constructor to include an artificial delay for the default configuration!

public CassandraContainerTest(CassandraDefaultFixture fixture)
{
    this.fixture = fixture;
    if (GetType().Name.Contains("CassandraDefaultConfiguration"))
    {
        Thread.Sleep(TimeSpan.FromSeconds(2));
    }
}

I see two possible solutions to this problem that occurs because the CqlConnection class uses a static concurrent dictionary of clusters keyed by cluster name, i.e. shared across all connections.

  1. We update the test to explicitly set different cluster names.
diff --git a/tests/Testcontainers.Cassandra.Tests/CassandraContainerTest.cs b/tests/Testcontainers.Cassandra.Tests/CassandraContainerTest.cs
index d351eab..62ebcb6 100644
--- a/tests/Testcontainers.Cassandra.Tests/CassandraContainerTest.cs
+++ b/tests/Testcontainers.Cassandra.Tests/CassandraContainerTest.cs
@@ -8,7 +8,8 @@ public abstract class CassandraContainerTest(CassandraContainerTest.CassandraDef
     public void ConnectionStateReturnsOpen()
     {
         // Given
-        using DbConnection connection = fixture.CreateConnection();
+        var connectionString = new CassandraConnectionStringBuilder(fixture.ConnectionString) { ClusterName = GetType().Name }.ConnectionString;
+        using DbConnection connection = new CqlConnection(connectionString);
 
         // When
         connection.Open();
  1. We update the CassandraContainer to include a unique cluster name per host/port in the connection string.
diff --git a/src/Testcontainers.Cassandra/CassandraContainer.cs b/src/Testcontainers.Cassandra/CassandraContainer.cs
index 3e44d61..2709655 100644
--- a/src/Testcontainers.Cassandra/CassandraContainer.cs
+++ b/src/Testcontainers.Cassandra/CassandraContainer.cs
@@ -18,7 +18,9 @@ public sealed class CassandraContainer : DockerContainer, IDatabaseContainer
     {
         var properties = new Dictionary<string, string>();
         properties.Add("Contact Points", Hostname);
-        properties.Add("Port", GetMappedPublicPort(CassandraBuilder.CqlPort).ToString());
+        var port = GetMappedPublicPort(CassandraBuilder.CqlPort).ToString();
+        properties.Add("Port", port);
+        properties.Add("Cluster Name", $"{Hostname}:{port}");
         return string.Join(";", properties.Select(property => string.Join("=", property.Key, property.Value)));
     }

Both strategies work. I think the second one is a bit better but I'm not a Cassandra user so I'm not 100% sure.

Because the `CqlConnection` class uses a static concurrent dictionary of clusters keyed by cluster name. Using multiple containers (which actually happens in the tests) can fail because the `CqlConnection` will try to connect to a potentially disposed container.
@0xced
Copy link
Contributor Author

0xced commented Apr 29, 2025

I just pushed solution #2 in ac743ce.

✅ The tests are now all passing!

@HofmeisterAn
Copy link
Collaborator

Thanks for the adjustments.

I prefer setting the cluster name property too. Leaving it unset prevents Testcontainers users from running multiple containers. If any issues come up afterward, we can deal with them as they arise.

I also prepared an issue I'd like to share with DataStax. Unfortunately, I can't create a GitHub or Jira issue at the moment, I'll send the following to their mailing list later:

We're running into a race-condition with the DataStax C# driver for Apache Cassandra, and I believe this is a general issue with the client implementation.

We're running multiple Apache Cassandra containers on the same host. It looks like the client caches connections based on the cluster name. If the connection string doesn't specify a cluster name, it uses the default cluster name.

Here's what happens:

  1. Create an Apache Cassandra instance listening on localhost:51732 (container_1).
  2. Create an Apache Cassandra instance listening on localhost:51742 (container_2).
  3. Create a CqlConnection instance for container_1.
    • This creates a connection using the default cluster name: Cassandra Cluster.
  4. Run tests against container_1, and tear it down afterwards.
  5. Create a CqlConnection instance for container_2.
    • This is where the race-condition happens. It looks like the client implementation still tries to connect to container_1, even though the connection string is correctly set for container_2.
    • This could even lead to false assumptions. For example, if we didn't tear down container_1 beforehand, the client might end up connecting to the wrong instance.
    • From a quick look at the code, it seems like the cluster name is never removed from the internal dictionary during Close() or Dispose().

@HofmeisterAn HofmeisterAn changed the title Introduce a new UntilDatabaseIsAvailable wait strategy feat: Add a wait strategy that waits until the ADO.NET database is available Apr 30, 2025
@HofmeisterAn HofmeisterAn added the enhancement New feature or request label Apr 30, 2025
Copy link
Collaborator

@HofmeisterAn HofmeisterAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🙇

@HofmeisterAn HofmeisterAn merged commit 4788ed6 into testcontainers:develop Apr 30, 2025
70 checks passed
@0xced 0xced deleted the UntilDatabaseIsAvailable branch April 30, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants