题目是:有2个10G的数据库,存储了一些string. 2者之间有一些重复的数据。请把它们合并为一个数据库,而且去除重复。html
限制:内存是4Gmysql
例如: DB1: cmu, ucb, stanford, nyusql
DB2: ucsb, ucb, ucsd, cmu.数据库
二者合并后,应该是: DB: cmu, ucb, stanford, nyu, ucsb, ucsd.ide
做法:把DB1分为5个小的数据库,分别是DB11, DB12, DB13, DB14, DB15this
把DB2分为5个小的数据库,分别是DB22, DB22, DB23, DB24, DB25spa
把DB11 与 DB22, DB22, DB23, DB24, DB25 分别进行Union操做,生成DB11Merge.code
把DB12 与 DB22, DB22, DB23, DB24, DB25 分别进行Union操做,生成DB12Merge.htm
......blog
最后再把DB11Merge, DB12Merge, DB13Merge, DB14Merge, DB15Merge 合并在一块儿便可
用如下语句便可:
mysql> insert into merge select * from persons2;
如下是实验结果:
A UNION query returns only distinct rows. (There is also UNION ALL, but that would include duplicate rows, so you don't want it here.)
1 mysql> select * from persons2; +-----------+ 2 3 | FirstName | 4 5 +-----------+ 6 7 | zelin | 8 9 | qihao | 10 11 +-----------+ 12 13 2 rows in set (0.00 sec) 14 15 16 17 mysql> select * from persons; 18 19 +-----------+ 20 21 | FirstName | 22 23 +-----------+ 24 25 | yu | 26 27 | zhixu | 28 29 | zelin | 30 31 +-----------+ 32 33 3 rows in set (0.00 sec) 34 35 36 37 mysql> 38 39 mysql> select * from persons union select * from persons2; 40 41 +-----------+ 42 43 | FirstName | 44 45 +-----------+ 46 47 | yu | 48 49 | zhixu | 50 51 | zelin | 52 53 | qihao | 54 55 +-----------+ 56 57 4 rows in set (0.00 sec)
顺便介绍几个DB经常使用的merge用的语句:
http://www.w3schools.com/sql/sql_join.asp
An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them.
The most common type of join is: SQL INNER JOIN (simple join). An SQL INNER JOIN return all rows from multiple tables where the join condition is met.
Let's look at a selection from the "Orders" table:
OrderID | CustomerID | OrderDate |
---|---|---|
10308 | 2 | 1996-09-18 |
10309 | 37 | 1996-09-19 |
10310 | 77 | 1996-09-20 |
Then, have a look at a selection from the "Customers" table:
CustomerID | CustomerName | ContactName | Country |
---|---|---|---|
1 | Alfreds Futterkiste | Maria Anders | Germany |
2 | Ana Trujillo Emparedados y helados | Ana Trujillo | Mexico |
3 | Antonio Moreno Taquería | Antonio Moreno | Mexico |
Notice that the "CustomerID" column in the "Orders" table refers to the "CustomerID" in the "Customers" table. The relationship between the two tables above is the "CustomerID" column.
Then, if we run the following SQL statement (that contains an INNER JOIN):
it will produce something like this:
OrderID | CustomerName | OrderDate |
---|---|---|
10308 | Ana Trujillo Emparedados y helados | 9/18/1996 |
10365 | Antonio Moreno Taquería | 11/27/1996 |
10383 | Around the Horn | 12/16/1996 |
10355 | Around the Horn | 11/15/1996 |
10278 | Berglunds snabbköp | 8/12/1996 |
Before we continue with examples, we will list the types the different SQL JOINs you can use:
在mysql中没有full join语句,咱们须要用union:
mysql> SELECT * FROM persons LEFT JOIN persons2 ON persons.firstName=persons2.firstName UNION SELECT * FROM persons RIGHT JOIN persons2 ON persons.firstName=persons2.firstName;
+-----------+-----------+
| FirstName | FirstName |
+-----------+-----------+
| zelin | zelin |
| yu | NULL |
| zhixu | NULL |
| NULL | qihao |
+-----------+-----------+
4 rows in set (0.00 sec)
使用replace语句也能够达到去重的效果。前提是,咱们把想要去重的项目设置为primary key便可。
REPLACE [LOW_PRIORITY | DELAYED] [INTO] tbl_name
[(col_name
,...)] {VALUES | VALUE} ({expr
| DEFAULT},...),(...),...
Or:
REPLACE [LOW_PRIORITY | DELAYED] [INTO] SET ={ | DEFAULT}, ... tbl_namecol_nameexpr
Or:
REPLACE [LOW_PRIORITY | DELAYED] [INTO] [(,...)] SELECT ... tbl_namecol_name
REPLACE
works exactly like INSERT
, except that if an old row in the table has the same value as a new row for aPRIMARY KEY
or a UNIQUE
index, the old row is deleted before the new row is inserted. See Section 13.2.5, “INSERT Syntax”.