Skip to content

Interest in bulk / batch insert operations when not using post-insert identifiers? #11977

@Ocramius

Description

@Ocramius

Context

I'm currently working on a codebase where I identified a repeated INSERT operation being performed.

As you may know, I've been a long time advocate for using UUIDs (and specifically, UUID5 / UUID6) for synthetic entity identifiers.

Due to recent advancements in profiling technology (Sentry, Tideways, etc.), I've managed to find a typical N+1 query:

INSERT INTO foo (bar, baz) VALUES (?, ?);
INSERT INTO foo (bar, baz) VALUES (?, ?);
-- ...
-- repeats until entities of the same type are gone

What can be done?

When AUTO_INCREMENT, SERIAL or post-insert identifiers in general aren't necessary, we can perform bulk / batch INSERT operations: this reduces the number of roundtrips to the DB, which is mostly network time.

Current behavior

I've traced the operation as follows:

  1. UnitOfWork computes insert order:

    orm/src/UnitOfWork.php

    Lines 1038 to 1040 in 05c8c5f

    private function executeInserts(): void
    {
    $entities = $this->computeInsertExecutionOrder();
  2. computeInsertExecutionOrder produces a sorted list of INSERT operations:

    orm/src/UnitOfWork.php

    Lines 1191 to 1246 in 05c8c5f

    private function computeInsertExecutionOrder(): array
    {
    $sort = new TopologicalSort();
    // First make sure we have all the nodes
    foreach ($this->entityInsertions as $entity) {
    $sort->addNode($entity);
    }
    // Now add edges
    foreach ($this->entityInsertions as $entity) {
    $class = $this->em->getClassMetadata($entity::class);
    foreach ($class->associationMappings as $assoc) {
    // We only need to consider the owning sides of to-one associations,
    // since many-to-many associations are persisted at a later step and
    // have no insertion order problems (all entities already in the database
    // at that time).
    if (! $assoc->isToOneOwningSide()) {
    continue;
    }
    $targetEntity = $class->getFieldValue($entity, $assoc->fieldName);
    // If there is no entity that we need to refer to, or it is already in the
    // database (i. e. does not have to be inserted), no need to consider it.
    if ($targetEntity === null || ! $sort->hasNode($targetEntity)) {
    continue;
    }
    // An entity that references back to itself _and_ uses an application-provided ID
    // (the "NONE" generator strategy) can be exempted from commit order computation.
    // See https://github.com/doctrine/orm/pull/10735/ for more details on this edge case.
    // A non-NULLable self-reference would be a cycle in the graph.
    if ($targetEntity === $entity && $class->isIdentifierNatural()) {
    continue;
    }
    // According to https://www.doctrine-project.org/projects/doctrine-orm/en/2.14/reference/annotations-reference.html#annref_joincolumn,
    // the default for "nullable" is true. Unfortunately, it seems this default is not applied at the metadata driver, factory or other
    // level, but in fact we may have an undefined 'nullable' key here, so we must assume that default here as well.
    //
    // Same in \Doctrine\ORM\Tools\EntityGenerator::isAssociationIsNullable or \Doctrine\ORM\Persisters\Entity\BasicEntityPersister::getJoinSQLForJoinColumns,
    // to give two examples.
    $joinColumns = reset($assoc->joinColumns);
    $isNullable = ! isset($joinColumns->nullable) || $joinColumns->nullable;
    // Add dependency. The dependency direction implies that "$entity depends on $targetEntity". The
    // topological sort result will output the depended-upon nodes first, which means we can insert
    // entities in that order.
    $sort->addEdge($entity, $targetEntity, $isNullable);
    }
    }
    return $sort->sort();
    }
  3. EntityPersister#addInsert() is called in a loop:

    orm/src/UnitOfWork.php

    Lines 1038 to 1048 in 05c8c5f

    private function executeInserts(): void
    {
    $entities = $this->computeInsertExecutionOrder();
    $eventsToDispatch = [];
    foreach ($entities as $entity) {
    $oid = spl_object_id($entity);
    $class = $this->em->getClassMetadata($entity::class);
    $persister = $this->getEntityPersister($class->name);
    $persister->addInsert($entity);
  4. in the same loop, EntityPersister#executeInserts() is called:

    orm/src/UnitOfWork.php

    Lines 1048 to 1052 in 05c8c5f

    $persister->addInsert($entity);
    unset($this->entityInsertions[$oid]);
    $persister->executeInserts();
  5. EntityPersister#executeInserts() creates a prepared statement, and fires it at the DB:
    public function executeInserts(): void
    {
    if (! $this->queuedInserts) {
    return;
    }
    $uow = $this->em->getUnitOfWork();
    $idGenerator = $this->class->idGenerator;
    $isPostInsertId = $idGenerator->isPostInsertGenerator();
    $stmt = $this->conn->prepare($this->getInsertSQL());
    $tableName = $this->class->getTableName();
    foreach ($this->queuedInserts as $key => $entity) {
    $insertData = $this->prepareInsertData($entity);
    if (isset($insertData[$tableName])) {
    $paramIndex = 1;
    foreach ($insertData[$tableName] as $column => $value) {
    $stmt->bindValue($paramIndex++, $value, $this->columnTypes[$column]);
    }
    }
    $stmt->executeStatement();
    if ($isPostInsertId) {
    $generatedId = $idGenerator->generateId($this->em, $entity);
    $id = [$this->class->identifier[0] => $generatedId];
    $uow->assignPostInsertId($entity, $generatedId);
    } else {
    $id = $this->class->getIdentifierValues($entity);
    }
    if ($this->class->requiresFetchAfterChange) {
    $this->assignDefaultVersionAndUpsertableValues($entity, $id);
    }
    // Unset this queued insert, so that the prepareUpdateData() method knows right away
    // (for the next entity already) that the current entity has been written to the database
    // and no extra updates need to be scheduled to refer to it.
    //
    // In \Doctrine\ORM\UnitOfWork::executeInserts(), the UoW already removed entities
    // from its own list (\Doctrine\ORM\UnitOfWork::$entityInsertions) right after they
    // were given to our addInsert() method.
    unset($this->queuedInserts[$key]);
    }
    }

The plan

The idea is to either change UnitOfWork#executeInserts() or UnitOfWork#computeInsertExecutionOrder() to produce a multidimensional array of micro-batches, where batches are only possible:

  • for the same entity type
  • for entities with an already assigned identifier

I would then extract the EntityPersister#executeInserts() call, so that it does not get performed in a loop (where applicable), and therefore internally reuses the same prepared statement.

I can then also optimize the EntityPersister#executeInserts() logic to generate SQL for a full batch (single statement).

The plan does NOT involve any DBAL API addition/change.

The question

Is there interest in this optimization?
Specifically, can I work on it, or is the plan flawed? This is mostly probing whether there is interest in this.

Past issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions