最近工做碰到一个问题,如何将大量数据(100MB+)导入到远程的mysql server上。html
尝试1: java
Statement执行executeBatch的方法。每次导入1000条记录。时间为12s/1000条。比较慢。mysql
对于1M次的插入这意味着须要4个多小时,期间还会由于网络情况,数据库负载等因素而把载入延迟提高到85s/1000条甚至更高。 sql
效果较差。数据库
尝试2:缓存
使用PreparedStatement,该方法须要预先给定insert操做的“格式”。服务器
实测用这种方式插入的效率为每秒钟数十行。网络
注意,将rewriteBatchedStatements设为true以后,在不到一分钟时间里面就将78万条数据所有导入数据库了。这是一个行之有效的方法。数据结构
代码:多线程
1 import java.io.BufferedReader; 2 import java.io.FileReader; 3 import java.sql.Connection; 4 import java.sql.DriverManager; 5 import java.sql.PreparedStatement; 6 7 /** 8 * 9 */ 10 public class PreparedStatementTestMain { 11 private static PreparedStatement ps; 12 public static void main(String[] args) { 13 try{ 14 Class.forName("com.mysql.jdbc.Driver"); 15 Connection conn = DriverManager.getConnection("jdbc:mysql://remote-host/test?user=xxx&password=xxx"); 16 String sql = "insert into test values(?,?,?,?,?,?,?,?,?,?,?)"; 17 ps = conn.prepareStatement(sql); 18 19 BufferedReader in = new BufferedReader(new FileReader("xxxx")); 20 String line; 21 int count =0; 22 while((line = in.readLine())!=null){ 23 count+=1; 24 String[] values = line.split("\t",-1); 25 //ps.setInt(1,count); 26 for(int i =1;i<values.length;i++) { 27 // if(i==6){ 28 // ps.setInt(i+1,Integer.parseInt(values[i])); 29 // }else{ 30 // if(values[i]==null){ 31 // ps.setString(i," "); 32 // }else { 33 ps.setString(i, values[i]); 34 // } 35 // } 36 } 37 ps.addBatch(); 38 System.out.println("Line "+count); 39 } 40 ps.executeBatch(); 41 ps.close(); 42 }catch(Exception e){ 43 e.printStackTrace(); 44 } 45 } 46 }
尝试3:
使用mysqlimport工具。通过实测,速度接近于尝试2中加上rewriteBatchedStatements以后的速度。不过前提是数据必需要保存为文件。
另一个思路:
多线程插入。
测试:
import java.sql.*; import java.util.Properties; import java.util.Random; import java.util.concurrent.*; public class TestMultiThreadInsert { private static final String dbClassName = "com.mysql.jdbc.Driver"; private static final String CONNECTION = "jdbc:mysql://host/"; private static final String USER = "x"; private static final String PASSWORD = "xxx"; private static final int THREAD_NUM=10; private static void executeSQL(Connection conn, String sql) throws SQLException { Statement stmt = conn.createStatement(); stmt.execute(sql); } private static void ResetEnvironment() throws SQLException { Properties p = new Properties(); p.put("user", USER); p.put("password", PASSWORD); Connection conn = DriverManager.getConnection(CONNECTION, p); { for (String query: new String[] { "USE test", "CREATE TABLE IF NOT EXISTS MTI (ID INT AUTO_INCREMENT PRIMARY KEY,MASSAGE VARCHAR(9) NOT NULL)", "TRUNCATE TABLE MTI" }) { executeSQL(conn, query); } } } private static void worker() { Properties properties = new Properties(); properties.put("user", USER); properties.put("password", PASSWORD); try{ Connection conn = DriverManager.getConnection(CONNECTION, properties); executeSQL(conn, "USE test"); while (!Thread.interrupted()) { executeSQL(conn, String.format("INSERT INTO MTI VALUES (NULL,'hello')")); System.out.println("Inserting "+value+" finished."); } } catch (SQLException e) { e.printStackTrace(); } } public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException { Class.forName(dbClassName); ResetEnvironment(); ExecutorService executor = Executors.newFixedThreadPool(THREAD_NUM); for (int i = 0; i < THREAD_NUM; i++) { executor.submit(new Runnable() { public void run() { worker(); } }); } Thread.sleep(20000); executor.shutdownNow(); if (!executor.awaitTermination(5, TimeUnit.SECONDS)) { System.err.println("Pool did not terminate"); } } }
20个线程分别单条单条地插入,20秒钟插入2923条;
10个线程分别单条单条地插入,20秒钟插入1699条;
1个线程单条单条地插入,20秒钟330条。
预测:将多线程与PreparedStatement结合预计能够提升插入速度。
可是使用多线程插入会不可避免的要考虑到一个问题:写锁。
虽然上面的程序确实证实了多线程插入的可行性,可是背后的逻辑是什么样的呢?有必要进行一下解读。
上面的代码中的多线程对应的是多个链接(可参考:https://dev.mysql.com/doc/refman/5.5/en/connection-threads.html),经过多线程主要是提升了命令提交速度,而不是多个执行线程。至于如何执行,还须要考察InnoDB(目前所用的数据库引擎)对数据插入的处理机制。
为了解决这个问题,经过搜索,查到了这些可能存在联系的关键词:
1.io_threads(https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-multiple_io_threads.html),
2.锁,
3.insert buffer(https://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-performance-change_buffering.html)。
关于insert buffer,理解下面这句话是关键:
innodb使用insert buffer"欺骗"数据库:对于为非惟一索引,辅助索引的修改操做并不是实时更新索引的叶子页,而是把若干对同一页面的更新缓存起来作合并为一次性更新操做,转化随机IO 为顺序IO,这样能够避免随机IO带来性能损耗,提升数据库的写性能。
要理解上面那句话,要先知道innoDB使用了什么样的数据结构来存储数据。
B+Tree!
关于B+Tree 网上一堆说明,这里不做赘述。
关于io_threads,是利用多个线程来处理对数据页的读写。
有一个问题依然没有说明白:锁!
(官方)locking
The system of protecting a transaction from seeing or changing data that is being queried or changed by other transactions. The locking strategy must balance reliability and consistency of database operations (the principles of the ACID philosophy) against the performance needed for goodconcurrency. Fine-tuning the locking strategy often involves choosing an isolation level and ensuring all your database operations are safe and reliable for that isolation level.
innodb为了提升读的性能,自定义了read write lock,也就是读写锁。其设计原则是:
一、同一时刻容许多个线程同时读取内存中的变量
二、同一时刻只容许一个线程更改内存中的变量
三、同一时刻当有线程在读取变量时不容许任何线程写存在
四、同一时刻当有线程在更改变量时不容许任何线程读,也不容许出本身之外的线程写(线程内能够递归占有锁)。
五、当有rw_lock处于线程读模式下是有线程写等待,这时候若是再有其余线程读请求锁的时,这个读请求将处于等待前面写完成。
既然有了锁,那么如何利用多线程的写操做来提升效率呢?
思考角度:提升互斥锁的切换效率!
怎么作到?
参考http://www.2cto.com/database/201411/352586.html
https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-latching.html
On many platforms, Atomic operations can often be used to synchronize the actions of multiple threads more efficiently than Pthreads. Each operation to acquire or release a lock can be done in fewer CPU instructions, wasting less time when threads contend for access to shared data structures. This in turn means greater scalability on multi-core platforms.
On platforms where the GCC, Windows, or Solaris functions for atomic memory access are not available, InnoDB
uses the traditional Pthreads method of implementing mutexes and read/write locks.
mutex
Informal abbreviation for "mutex variable". (Mutex itself is short for "mutual exclusion".) The low-level object that InnoDB uses to represent and enforce exclusive-access locks to internal in-memory data structures. Once the lock is acquired, any other process, thread, and so on is prevented from acquiring the same lock. Contrast with rw-locks, which InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.
rw-lock
The low-level object that InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures following certain rules. Contrast with mutexes, which InnoDB uses to represent and enforce exclusive access to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.
rw-lock
types include s-locks
(shared locks), x-locks
(exclusive locks), and sx-locks
(shared-exclusive locks).
An s-lock
provides read access to a common resource.
An x-lock
provides write access to a common resource while not permitting inconsistent reads by other threads.
An sx-lock
provides write access to a common resource while permitting inconsistent reads by other threads. sx-locks
were introduced in MySQL 5.7 to optimize concurrency and improve scalability for read-write workloads.
The following matrix summarizes rw-lock type compatibility.
S |
SX |
X |
|
---|---|---|---|
S |
Compatible | Compatible | Conflict |
SX |
Compatible | Conflict | Conflict |
X |
Conflict | Conflict | Conflict |
补充:
rewriteBatchedStatements到底为何对速度优化这个多?
一种说法:这样作的目的是为了让mysql可以将多个mysql insert语句打包成一个packet和mysql服务器通讯。这样能够极大下降网络开销。
另外一种说法:
Rewriting Batches
“rewriteBatchedStatements=true”
Affects (Prepared)Statement.add/executeBatch()
Core concept - remove latency
Special treatment for prepared INSERT statements
——Mark Matthews - Sun Microsystems
PreparedStatement VS Statement
数据库系统会对sql语句进行预编译处理(若是JDBC驱动支持的话),预处理语句将被预先编译好,这条预编译的sql查询语句能在未来的使用中重用。