Loading data fast – regular insert vs. bulk insert

In my last post I talked about how persisting data can become the bottleneck on large high scale systems nowadays. I also talked about that more and more people tend to think that databases are simply slow, seeing them as just big I/O systems. And I talked about how lots of applications are still inserting data as they used to do years ago rather to use bulk inserts.

In this post I will show you how bulk inserts can actually boost your inserts and therefore you systems. I will use a simple example showing the difference between:

Single row insert with commit
Single row insert with only one final commit
Bulk insert with final commit

Let’s assume you have a java program that needs to load some data from a file into a single table. Each line in the file represents a row in the database. I won’t go into how to read from the file and build your data together. This is out of scope for this post and not relevant to show the benefit of bulk inserts over regular ones. First let’s build a simple two column table, including an “id” column as NUMBER and a “text” column as VARCHAR2:

CREATE TABLE TESTLOADTABLE (id NUMBER, text VARCHAR2(255));

table TESTLOADTABLE created.

For each test I truncate the table first just to make sure that I always load the same amount of data into the same empty table. I add a comment in the statements, so that I can separate them out later on in the trace file.

The first example loads 10,000 rows into the table. It will simply insert an incrementing counter and a string into the table followed by a commit.

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* conventional insert with commit */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.execute();
  conn.commit();
}

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

Looking at the trace file it took the program 2.21 seconds to load these 10,000 rows. You can also see that the statement got actually executed 10,000 times – the commits unfortunately don’t show up in the formatted trace file but they would be listed in the raw data trace file.

 SQL ID: 337xy5qc84nsq Plan Hash: 0

INSERT /* conventional insert with commit */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute  10000      2.06       2.21          2         90      20527       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total    10001      2.06       2.21          2         90      20527       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=2 pw=0 time=1637 us)

The next test puts the commit outside of the loop. So I still add the data row by row to the table but the commit itself happens only once after all rows were loaded:

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* conventional insert */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.execute();
}
conn.commit();

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

As the results show the statement was still executed 10000 times. However, this time it took only 1.19 seconds to insert all the data. So by just moving the commit to the end, after all inserts were done, I gained already 57% more performance! Although it seems that commits are light weighted and don’t do much, the database still has some tasks to accomplish to make sure that your transaction is saved and visible. And of course instead of having only 1 transaction, I have 10,000 in this case.

 SQL ID: drsv4dw4037zj Plan Hash: 0

INSERT /* conventional insert */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute  10000      1.09       1.19          2        114      10562       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total    10001      1.09      1.19          2        114      10562       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=2 pw=0 time=1507 us)

The Oracle JDBC driver does support bulk inserts. What you can do is that you add all your data to a “batch” and then execute the entire batch. This gives you the advantage that there is only 1 INSERT statement executed which inserts all your data into the table at once! So instead of 10,000 round-trips, I only have 1 sending all the data over in one big chunk:

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* addBatch insert */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.addBatch();
}
stmt.executeBatch();
conn.commit();

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

Now these results are pretty amazing! There was only 1 execution of that insert statement and that loaded the entire batch in only 0.06 seconds!

 SQL ID: gfkg1d43va20y Plan Hash: 0

INSERT /* addBatch insert */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute      1      0.05       0.06          0        129        394       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total        2      0.05       0.06          0        129        394       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=139 pr=0 pw=0 time=51971 us)

As you can see, bulk inserting can give your system a huge performance boost! In this example here we are not talking about an improvement of percents anymore, but about factors! Imagine what this could gain you when you have even more data!

This post shows that persisting data into a database is not just simply slow but it also matters a lot how you actually insert that data!

11 thoughts on “Loading data fast – regular insert vs. bulk insert”

AndroidResearch says:

March 31, 2012 at 13:43

That is something new, might give it a try.

LikeLike

fcosfc says:

April 9, 2012 at 07:23

Hi Gerald,

Great post, but I think that just the last two cases are comparable from a logical point of view, because different statements are grouped in one transaction if they are part of a single unit of work and not because performance reasons.

Best regards,

Paco.

LikeLike

1. Venzi says:
  
  April 9, 2012 at 16:43
  
  Hi Paco,
  
  You are right, from a logical point of view the two scenarios are identical but not only those two, but all three of them. From a logical point of view all three statements insert 10,000 rows into that table. Either way does the job semantically correct. While the first scenario uses a lot of transactions to accomplish the task, the other two use only a single transaction. But the third one also only executed 1 INSERT statement rather than 10,000.
  
  My point is: Many ways lead to Rome but the question (from a performance point of view) is: Which one is the most efficient?
  There are plenty of programs out there that use the first or second scenario approach and struggle with performance problems. Here I show programmers another, much more efficient way of how to accomplish the job!
  
  Thanks,
  
  LikeLike
  
  1. fcosfc says:
    
    April 10, 2012 at 03:41
    
    Hi Gerald,
    
    But, what would happen if you get an exception in the first scenario? You could only rollback one insert, not the 10,000 ones, that’s what I mean.
    
    Regards,
    
    Paco.
    
    LikeLike
    
Venzi says:

April 10, 2012 at 18:03

Hi Paco,

Well, that is correct but why would you want to roll back the entire transaction anyways? Rows are independent from each other (there are some cases where they aren’t but those are really rare), why would you like to roll back 9,999 valid rows just because the 10,000th row had e.g. an unique key violation because the same data came already in from some previous record? It is actually that scenario that leads programmers to write code like in the first scenario or hopefully an enhanced second scenario where they catch the exception within the loop…

Thx,

LikeLike

Adrian says:

April 11, 2012 at 05:44

If you set autocommit on the statement prior to the final insert, you’ll save yourself one network round trip too. 😉

LikeLike

1. Venzi says:
  
  April 15, 2012 at 16:09
  
  Hi Adrian,
  
  No you won’t! Turning autocommit on on the connection also sends a commit to the database. Only the point in the code where it’s send is different! Furthermore, now you would have to introduce a check in the loop if you are at the last iteration which basically means that you will have something like “if (i+1 == 10000) { conn.setAutoCommit(true); }” inside your loop. So instead of reducing the work you introduce even more!
  
  LikeLike
  
Pingback: Loading data fast – Batch inserts and errors (DML error logging) « Venzi's Tech-Blog
Pingback: PL/SQL performance tuning with bulk SQL/binding and parallelism
Pingback: How to add my data to a temporary file so I can do a BULK INSERT with php for SQL server? – Ask your php questions
Pingback: Best Database specifications for my IOT solution

Loading data fast – regular insert vs. bulk insert

Published by Gerald

11 thoughts on “Loading data fast – regular insert vs. bulk insert”

Leave a reply to fcosfc Cancel reply

Share this:

Related

Published by Gerald

11 thoughts on “Loading data fast – regular insert vs. bulk insert”

Leave a reply to fcosfc Cancel reply