mongo_sync

Mongodb realtime synchronizer, which is similar to py-mongo-sync

Features

Support

Mongodb 3.6+ (because of official mongodb driver only support mongodb 3.6+)

Intall

The recommended way to install mongo_sync is using cargo: shell cargo +nightly install mongo_sync

You can download released binary as well.

Running tests?

To running integration tests, you need to config SYNCER_TEST_SOURCE to a testing mongodb uri, or mongodb://localhost:27017 will be used.

Example

To run synchronizer, you need to start oplog_syncer to make a realtime mongodb oplog sync first.

Then you can run db_sync to sync database in realtime.

oplog_syncer

shell ./target/release/oplog_syncer --src-uri "mongodb://localhost:27017" --oplog-storage-uri "mongodb://localhost:27018/"

db_sync

shell db_sync --src-uri "mongodb://localhost:27017/?authSource=admin" --oplog-storage-uri "mongodb://localhost:27018/?authSource=admin" --target-uri "mongodb://localhost:27019" --db test_db

Note that the --oplog-storage-uri in oplogsyncer and dbsync must be the same.

Usage help

oplog_syncer

```shell USAGE: oplog_syncer [OPTIONS] --src-uri --oplog-storage-uri

FLAGS: -h, --help Prints help information -V, --version Prints version information

OPTIONS: --log-path log file path, if not specified, all log information will be output to stdout

-o, --oplog-storage-uri <oplog-storage-uri>    target oplog storage uri
-s, --src-uri <src-uri>                        source database uri, must be a mongodb cluster

```

db_sync

```shell USAGE: db_sync [OPTIONS] --src-uri --target-uri --oplog-storage-uri --db

FLAGS: -h, --help Prints help information -V, --version Prints version information

OPTIONS: --collection-concurrent how many threads to sync a database -c, --colls ... collections to sync, default sync all collections inside a database

-d, --db <db>                                          database to sync
    --doc-concurrent <doc-concurrent>                  how many threads to sync a collection
    --log-path <log-path>
        log file path, if no specified, all log information will be output to stdout

-o, --oplog-storage-uri <oplog-storage-uri>
        mongodb uri which save oplogs, it's saved by `oplog_syncer` binary

-s, --src-uri <src-uri>                                source mongodb uri
-t, --target-uri <target-uri>                          target mongodb uri

```

The basic arthitecture diagram

┌───────────────┐ │ target db │ └───┬───────────┘ │ xxxx ▼ xxxxxxx xx ┌───────┐ x xx │db_sync│ x └───────┘ x xxxxxx ▲ ▲ x xx │ │ xxxx │ │ │ │ Full dump │ │Incr dump │ │(Real time) │ │ │ │ │ │ │ │ ┌─────────────────────┐ │ └─────────┤oplog storage db │ │ └──────▲──────────────┘ │ xxxxxxx │ xxxxx │ x ┌──────┴──────┐x │ x │Oplog syncer │x Sync oplog from source cluster │ x └──────▲──────┘x to oplog storage in real time │ xxxxxxxxx│ xxxxxxx │ │ │ │ │ │ │ ┌──────┴───────┐ └────────────┤Source cluster│ └──────────────┘

According to the diagram, you can find that there are 2 basic programs provided by mongo_sync 1. oplog syncer: sync mongodb cluster's oplog to target oplog storage db. 2. db sync: sync data from source cluster to target db.

Benchmark

It's not strictly benchmark test, I just test it manually.

Scenario:

When source cluster insert 50,000 records, how long the target_db can synchronizer these new 50,000 insert.

My testing result:

db_sync takes about 50 seconds to sync these update, and py-mongo-sync takes about 225 seconds to sync these update. In general, it's about 3.5x faster than py-mongo-sync.

And, please note that 50 seconds is not accrutely, it highly depends on your database and running machine performance.

How does the core work?

Notes

  1. In incremental state, for now it only support the following command to be sync (which enough for my personal use.):
  2. rename collection
  3. dorp collection
  4. create collection
  5. drop indexes
  6. create indexes
  7. I haven't test mongo sharding as target, but it should be ok to work.
  8. during running oplog_syncer, oplog storage db will create and using databse named source_oplog, and create and using collection named source_oplog. For now this is hardcoded.
  9. during running db_sync, target databse will create a new collection named oplog_records, it saves the latest oplog timestamp applied to the database.