im triying to batch insert a few Million Entitys. The Batch insert kind of works, but my programm executes a few JDBC Statements in the background which i dont want.
List < IceCream > iceList = new ArrayList < IceCream > (); for (CSVRecord record: records) { if (flushCounter > 40000) { iceCreamRepository.saveAll(iceList); iceList= new ArrayList < IceCream > (); flushCounter = 0; } flushCounter++; IceCream iceCream = new IceCream(); int id = getIdFromCSV(); iceCream.setId(id); iceCream.set... ... iceList.add(iceCream);
}
my Repository:
public interface IceCreamRepository extends JpaRepository<IceCream, Long> { }
my Entity:
@Entity @Table(name="IceCream") public class IceCream { private static final long serialVersionUID = 1L; @OneToMany(targetEntity=entity.OtherEntity.class, mappedBy="IceCream") private Set<OtherEntity> otherEntitys = new HashSet<OtherEntity>(); @Id private int id; @Basic @Column(name="import_tstamp") @Temporal(TemporalType.TIMESTAMP) private Date importTstamp; @Basic @Column(name="import_type", length=2147483647) private String importType; @Basic @Column(length=2147483647) private String text; ... }
my JPA Settings:
spring.jpa.properties.hibernate.batch_versioned_data: true spring.jpa.properties.hibernate.order_updates: true spring.jpa.properties.hibernate.order_inserts: true spring.jpa.properties.hibernate.generate_statistics: true spring.jpa.properties.hibernate.jdbc.format_sql: true spring.jpa.properties.hibernate.jdbc.batch_size: 1000
The Batch Insert Does work, but if i try to Upload 100 Entitys i have 33 JDBC Statements which are checking the id.
This is the output for 33 Entitys:
2020-03-25 09:25:50.172 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_, ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)] 2020-03-25 09:25:50.172 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_, ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(2)] 2020-03-25 09:25:50.172 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_, ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(3)] 2020-03-25 09:25:50.172 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_, ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(4)] 2020-03-25 09:25:50.172 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_, ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(5)]
… My Programm is trying to load the entitys but idk why, i havent inserted them yet. he does this for 32 id’s. For every id except the first one (0) after this output, There’s a batch insert for all 33 entitys …
2020-03-25 09:25:50.334 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:11, Success:True, Type:Prepared, Batch:True, QuerySize:1, BatchSize:33, Query:["insert into iceCream(import_tstamp, import_type, text, id) values (?, ?, ?, ?)"], Params:[(2020-03-25 09:25:50.127,ice,teext,0),(2020-03-25 09:25:50.127,ice,teext,1),(2020-03-25 09:25:50.127,ice,teext,2)...]
.. after that i get this summary:
2020-03-25 09:25:50.359 [scheduling-1] INFO org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics { 2222222 nanoseconds spent acquiring 1 JDBC connections; 0 nanoseconds spent releasing 0 JDBC connections; 21234400 nanoseconds spent preparing 33 JDBC statements; 40600005 nanoseconds spent executing 32 JDBC statements; 27859771 nanoseconds spent executing 1 JDBC batches; 0 nanoseconds spent performing 0 L2C puts; 0 nanoseconds spent performing 0 L2C hits; 0 nanoseconds spent performing 0 L2C misses; 100978099 nanoseconds spent executing 1 flushes (flushing a total of 34 entities and 33 collections); 0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections) }
if i only use 1 entity the output is:
2020-03-25 11:17:40.119 [scheduling-1] INFO org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics { 1375995 nanoseconds spent acquiring 1 JDBC connections; 0 nanoseconds spent releasing 0 JDBC connections; 12024409 nanoseconds spent preparing 1 JDBC statements; 0 nanoseconds spent executing 0 JDBC statements; 5597005 nanoseconds spent executing 1 JDBC batches; 0 nanoseconds spent performing 0 L2C puts; 0 nanoseconds spent performing 0 L2C hits; 0 nanoseconds spent performing 0 L2C misses; 38446070 nanoseconds spent executing 1 flushes (flushing a total of 1 entities and 1 collections); 0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections) }
for 2 Entitys it shows the following (my id starts at 0, so he only JDBC executes the second entity):
2020-03-25 09:25:50.172 [scheduling-1] INFO net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_, ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)] 2020-03-25 11:25:00.180 [scheduling-1] INFO org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics { 1446363 nanoseconds spent acquiring 1 JDBC connections; 0 nanoseconds spent releasing 0 JDBC connections; 13101435 nanoseconds spent preparing 2 JDBC statements; 11427142 nanoseconds spent executing 1 JDBC statements; 3762785 nanoseconds spent executing 1 JDBC batches; 0 nanoseconds spent performing 0 L2C puts; 0 nanoseconds spent performing 0 L2C hits; 0 nanoseconds spent performing 0 L2C misses; 22309803 nanoseconds spent executing 1 flushes (flushing a total of 2 entities and 2 collections); 0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections) }
the output for 3 is
2020-03-25 11:47:00.277 [scheduling-1] INFO org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics { 1010843 nanoseconds spent acquiring 1 JDBC connections; 0 nanoseconds spent releasing 0 JDBC connections; 31706133 nanoseconds spent preparing 3 JDBC statements; 57180996 nanoseconds spent executing 2 JDBC statements; 3839505 nanoseconds spent executing 1 JDBC batches; 0 nanoseconds spent performing 0 L2C puts; 0 nanoseconds spent performing 0 L2C hits; 0 nanoseconds spent performing 0 L2C misses; 23923340 nanoseconds spent executing 1 flushes (flushing a total of 3 entities and 3 collections); 0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections) }
… So i have two questions:
Why do I have all this JDBC statements when I only want one Batch insert? (and how do I fix this)
I tried this for a few Million entitys but I cant see any updates in the Database until the Programm is done. I do call the iceCreamRepository.saveAll(iceList); function every 4000 lines. I thought that this will write all the Entityts into the Database. My ram isnt hughe, i have a 10 gb file of Data and only 2 gb of ram. If the programm waits to write all the data till the end, why dont i run out of ram?
Advertisement
Answer
The answer is going to be a bit convoluted, but bear with me.
I do call the iceCreamRepository.saveAll(iceList)
From the above, I’m assuming you’re using Spring Data with JPA.
Why do I have all this JDBC statements when I only want one Batch insert? (and how do I fix this)
The implementation of JpaRepository.saveAll()
is to call save()
on each entity in the list, while the implementation of save()
is as follows:
if (entityInformation.isNew(entity)) { em.persist(entity); return entity; } else { return em.merge(entity); }
The default implementation of EntityInformation
‘considers an entity to be new whenever EntityInformation.getId(Object)
returns null’, meaning your entities fall into the second branch of the if ... else ...
statement.
Effectively, Spring Data is telling JPA to merge the entities with their existing version in the DB. Thus, JPA needs to load that existing version first, and this is why you’re seeing all the additional querying.
To solve this issue, either:
- make your entity implement
Persistable
, and returntrue
fromisNew()
(note that this may affect the persisting logic elsewhere; see this link for more information) - OR inject and interact with
EntityManager
directly, callingpersist()
instead ofmerge()
I tried this for a few Million entitys but I cant see any updates in the Database until the Programm is done
For actual queries to get executed, you need to call EntityManager.flush()
after each batch (if you choose not to interact with EntityManager
directly, use JpaRepository.flush()
instead)
(As a side note, JPA comes with a lot of overhead involving caching, conversions etc. and is generally a poor choice for batch operations. I’d consider switching to Spring Batch with JDBC if I were you)