im triying to batch insert a few Million Entitys. The Batch insert kind of works, but my programm executes a few JDBC Statements in the background which i dont want. } my Repository: my Entity: my JPA Settings: The Batch Insert Does work, but if i try to Upload 100 Entitys i have 33 JDBC Statements which are …

JPA Batch inserts with non auto generated id

im triying to batch insert a few Million Entitys. The Batch insert kind of works, but my programm executes a few JDBC Statements in the background which i dont want.

List < IceCream > iceList = new ArrayList < IceCream > ();

for (CSVRecord record: records) {
if (flushCounter > 40000) {

    iceCreamRepository.saveAll(iceList);
    iceList= new ArrayList < IceCream > ();
    flushCounter = 0;
}
flushCounter++;

IceCream iceCream = new IceCream();

int id = getIdFromCSV();
iceCream.setId(id);
iceCream.set...
    ...
iceList.add(iceCream);

}

my Repository:

public interface IceCreamRepository extends JpaRepository<IceCream, Long>
{
}

my Entity:

@Entity
@Table(name="IceCream")
public class IceCream 
{
   private static final long serialVersionUID = 1L;

   @OneToMany(targetEntity=entity.OtherEntity.class, mappedBy="IceCream")
   private Set<OtherEntity> otherEntitys = new HashSet<OtherEntity>();

   @Id
   private int id;

   @Basic
   @Column(name="import_tstamp")
   @Temporal(TemporalType.TIMESTAMP)
   private Date importTstamp;

   @Basic
   @Column(name="import_type", length=2147483647)
   private String importType;

   @Basic
   @Column(length=2147483647)
   private String text;

 ...

}

my JPA Settings:

spring.jpa.properties.hibernate.batch_versioned_data: true
spring.jpa.properties.hibernate.order_updates: true
spring.jpa.properties.hibernate.order_inserts: true
spring.jpa.properties.hibernate.generate_statistics: true
spring.jpa.properties.hibernate.jdbc.format_sql: true
spring.jpa.properties.hibernate.jdbc.batch_size: 1000

The Batch Insert Does work, but if i try to Upload 100 Entitys i have 33 JDBC Statements which are checking the id.

This is the output for 33 Entitys:

2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(2)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(3)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(4)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(5)]

… My Programm is trying to load the entitys but idk why, i havent inserted them yet. he does this for 32 id’s. For every id except the first one (0) after this output, There’s a batch insert for all 33 entitys …

2020-03-25 09:25:50.334 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:11, Success:True, Type:Prepared, Batch:True, QuerySize:1, BatchSize:33, Query:["insert into iceCream(import_tstamp, import_type, text, id) values (?, ?, ?, ?)"], Params:[(2020-03-25 09:25:50.127,ice,teext,0),(2020-03-25 09:25:50.127,ice,teext,1),(2020-03-25 09:25:50.127,ice,teext,2)...]

.. after that i get this summary:

2020-03-25 09:25:50.359 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    2222222 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    21234400 nanoseconds spent preparing 33 JDBC statements;
    40600005 nanoseconds spent executing 32 JDBC statements;
    27859771 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    100978099 nanoseconds spent executing 1 flushes (flushing a total of 34 entities and 33 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

if i only use 1 entity the output is:

2020-03-25 11:17:40.119 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    1375995 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    12024409 nanoseconds spent preparing 1 JDBC statements;
    0 nanoseconds spent executing 0 JDBC statements;
    5597005 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    38446070 nanoseconds spent executing 1 flushes (flushing a total of 1 entities and 1 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

for 2 Entitys it shows the following (my id starts at 0, so he only JDBC executes the second entity):

2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)]

2020-03-25 11:25:00.180 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    1446363 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    13101435 nanoseconds spent preparing 2 JDBC statements;
    11427142 nanoseconds spent executing 1 JDBC statements;
    3762785 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    22309803 nanoseconds spent executing 1 flushes (flushing a total of 2 entities and 2 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

the output for 3 is

2020-03-25 11:47:00.277 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    1010843 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    31706133 nanoseconds spent preparing 3 JDBC statements;
    57180996 nanoseconds spent executing 2 JDBC statements;
    3839505 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    23923340 nanoseconds spent executing 1 flushes (flushing a total of 3 entities and 3 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

… So i have two questions:

Why do I have all this JDBC statements when I only want one Batch insert? (and how do I fix this)
I tried this for a few Million entitys but I cant see any updates in the Database until the Programm is done. I do call the iceCreamRepository.saveAll(iceList); function every 4000 lines. I thought that this will write all the Entityts into the Database. My ram isnt hughe, i have a 10 gb file of Data and only 2 gb of ram. If the programm waits to write all the data till the end, why dont i run out of ram?

Answer

The answer is going to be a bit convoluted, but bear with me.

I do call the iceCreamRepository.saveAll(iceList)

From the above, I’m assuming you’re using Spring Data with JPA.

Why do I have all this JDBC statements when I only want one Batch insert? (and how do I fix this)

The implementation of JpaRepository.saveAll() is to call save() on each entity in the list, while the implementation of save() is as follows:

if (entityInformation.isNew(entity)) {
    em.persist(entity);
    return entity;
} else {
    return em.merge(entity);
}

The default implementation of EntityInformation ‘considers an entity to be new whenever EntityInformation.getId(Object) returns null’, meaning your entities fall into the second branch of the if ... else ... statement.

Effectively, Spring Data is telling JPA to merge the entities with their existing version in the DB. Thus, JPA needs to load that existing version first, and this is why you’re seeing all the additional querying.

To solve this issue, either:

make your entity implement Persistable, and return true from isNew() (note that this may affect the persisting logic elsewhere; see this link for more information)
OR inject and interact with EntityManager directly, calling persist() instead of merge()

I tried this for a few Million entitys but I cant see any updates in the Database until the Programm is done

For actual queries to get executed, you need to call EntityManager.flush() after each batch (if you choose not to interact with EntityManager directly, use JpaRepository.flush() instead)

(As a side note, JPA comes with a lot of overhead involving caching, conversions etc. and is generally a poor choice for batch operations. I’d consider switching to Spring Batch with JDBC if I were you)

Advertisement

Answer