Data Architecture & Design: mongoDB design tips(1)

1. Linux File Systems

MongoDB uses large files for storing data, and preallocates these. These filesystems seem to work well:

ext4 ( kernel version >= 2.6.23 )

xfs ( kernel version >= 2.6.25 )

In addition to the file systems above you might also want to (explicitly) disable file/directory modification times by using these mount options:

noatime (also enables nodiratime)

We have found ext3 to be very slow in allocating files (or removing them) as well as access within large files is also poor.

2. What Hardware?

MongoDB tends to run well on virtually all hardware. In fact it was designed specifically with commodity hardware in mind (to facilitate cloud

computing); that said it works well on very large servers too. That said if you are about to buy hardware here are a few suggestions:

1.Fast CPU clock speed is helpful.

2.Many cores helps but does not provide a high level of marginal return, so don't spend money on them. (This is both a consequence of the

design of the program and also that memory bandwidth can be a limiter; there isn't necessarily a lot of computation happening inside a

database).

3.NUMA is not very helpful as memory access is not very localized in a database. Thus non-NUMA is recommended; or configure NUMA

as detailed elsewhere in this document.

4.RAM is good.

5.SSD is good. We have had good results and have seen good price/performance with SATA SSDs; the (typically) more upscale PCI SSDs

work fine too.

6.Commodity (SATA) spinning drives are often a good option as the speed increase for random I/O for more expensive drives is not that

dramatic (only on the order of 2x) – spending that money on SSDs or RAM may be more effective.

3. PCI vs. SATA

SSD is available both as PCI cards and SATA drives. PCI is oriented towards the high end of products on the market.

Some SATA SSD drives now support 6Gbps sata transfer rates, yet at the time of this writing many controllers shipped with servers are 3Gbps.

For random IO oriented applications this is likely sufficient, but worth considering regardless.

4. RAM vs. SSD

Even though SSDs are fast, RAM is still faster. Thus for the highest performance possible, having enough RAM to contain the working set of data

from the database is optimal. However, it is common to have a request rate that is easily met by the speed of random IO's with SSDs, and SSD

cost per byte is lower than RAM (and persistent too).

A system with less RAM and SSDs will likely outperform a system with more RAM and spinning disks. For example a system with SSD drives and

64GB RAM will often outperform a system with 128GB RAM and spinning disks. (Results will vary by use case of course.)

One helpful characteristic of SSDs is they can facilitate fast "preheat" of RAM on a hardware restart. On a restart a system's RAM file system

cache must be repopulated. On a box with 64GB RAM or more, this can take a considerable amount of time – for example six minutes at

100MB/sec, and much longer when the requests are random IO to spinning disks.

5. FlashCache

FlashCache is a write back block cache for Linux. It was created by Facebook. Installation is a bit of work as you have to build and install a kernel

module. Sep2011: If you use this please report results in the mongo forum as it's new and everyone will be curious how well it works.

6. OS scheduler

One user reports good results with the noop IO scheduler under certain configurations of their system. As always caution is recommended on

nonstandard configurations as such configurations never get as much testing...

7. Virtualization

Generally MongoDB works very well in virtualized environments, with the exception of OpenVZ.

8. Replication

Two forms of replication are available, Replica Sets and Master-Slave. Use Replica Sets – replica sets are a functional superset of master/slave and are handled by much newer, more robust code.

9. Replica Set Design Concepts

majority of the set is not up or reachable, no member will be elected primary.

There is no way to tell (from the set's point of view) the difference between a network partition and nodes going down, so members left in a

minority will not attempt to become master (to prevent a set from ending up with masters on either side of a partition).

This means that, if there is no majority on either side of a network partition, the set will be read only (thus, we suggest an odd number of servers:

e.g., two servers in one data center and one in another). The upshot of this strategy is that data is consistent: there are no multi-master conflicts

to resolve.

There are several important concepts concerning data integrity with replica sets that you should be aware of:

1. A write is (cluster-wide) committed once it has replicated to a majority of members of the set.

For important writes, the client should request acknowledgement of this with a getLastError({w:...}) call. (If you do not call getLastError,

the servers do exactly the same thing; the getlasterror call is simply to get confirmation that committing is finished.)

For important writes, use the getLastError to ensure cluster-wide commits of critical writes.

2. Queries in MongoDB and replica sets have "READ UNCOMMITTED" semantics.

Writes which are committed at the primary of the set may be visible before the cluster-wide commit completes.

The read uncommitted semantics (an option on many databases) are more relaxed and make theoretically achievable performance and

availability higher (for example we never have an object locked in the server where the locking is dependent on network performance).

3. On a failover, if there are writes which have not replicated from the primary, the writes are rolled back. Thus we use getlasterror as in #1

above when we need to confirm a cluster-wide commit.

The data is backed up to files in the rollback directory, although the assumption is that in most cases this data is never recovered as that would

require operator intervention. However, it is not "lost," it can be manually applied at any time with mongorestore.

Rationale

Merging back old operations later, after another node has accepted writes, is a hard problem. One then has multi-master replication, with potential

for conflicting writes. Typically that is handled in other products by manual version reconciliation code by developers. We think that is too much

work : we want MongoDB usage to be less developer work, not more. Multi-master also can make atomic operation semantics problematic.

It is possible (as mentioned above) to manually recover these events, via manual DBA effort, but we believe in large system with many, many

nodes that such efforts become impractical.

Comments

Some drivers support 'safe' write modes for critical writes. For example via setWriteConcern in the Java driver.

Additionally, defaults for { w : ... } parameter to getLastError can be set in the replica set's configuration.

Note a call to getLastError will cause the client to have to wait for a response from the server. This can slow the client's throughput on writes if

large numbers are made because of the client/server network turnaround times. Thus for "non-critical" writes it often makes sense to make no

getLastError check at all, or only a single check after many writes.

10. Replica Set Commands

11. Forcing a Member to be Primary

In v2.0+,For example, if we had members A, B, and C and A is the current primary and we want B to be primary, we could give B a higher priority like so:

> config = rs.conf()

{

"_id" : "foo",

"version" : 1,

"members" : [

{

"_id" : 0,

"host" : "A",

{

"_id" : 1,

"host" : "B",

{

"_id" : 2,

"host" : "C",

}

]

}

> config.version++

> // the default priority is 1

> config.members[1].priority = 2

> rs.reconfig(config)

Assuming B is synced to within 10 seconds of A, A will step down, B (and C) will catch up to where A is, and B will be elected primary.

If B is far behind A, A will not step down until B is within 10 seconds of its optime. This minimizes the amount of time there will be no primary on

failover. If you do not care about how long the set is primary-less, you can force A to step down by running:

db.adminCommand({replSetStepDown:1000000, force:1})

12. Changing Config Servers

Best Practices

A.Use three config servers.

B.Give them logical (DNS) names. Do not refer to them (on the mongos ocmmand line) by IP address.

C.The DNS name should be logical (e.g. "mycluster_config2") not a name that is VM-specific.

Use a short DNS TTL (20 seconds or example).

D.Act slowly when replacing a config server – your system should be fine with one down. Better to not have an operational typo and to let

one be down for a while. If one is down, take some time and read this page, consider opening a support ticket, etc., before taking action.

The metadata for the cluster is very important.

E.When replacing a config server, make sure the old process/machine is truly down before starting up the new one. There may be legacy

sockets open to the old config server process otherwise, and that would be bad.

13. Simple Initial Sharding Architecture

Goal

Two datacenters (East=primary, West=backup/DR)

Data Tier (MongoDB)

3 shards

3 nodes per shard

9 hosts total

Application Tier

4 application servers

Machines Locations

e1-e6 are in 1 datacenter (East)

w1-w3 are in another datacenter (West)

Datacenter Roles

We'll use datacenter East as the primary, and data center West as disaster recovery.

14. Licensing

Database:

Free Software Foundation's GNU AGPL v3.0.

Commercial licenses are also available from 10gen, including free evaluation licenses.

Drivers:

mongodb.org supported drivers: Apache License v2.0.

Third parties have created drivers too; licenses will vary there.

Data Architecture & Design

Tuesday, February 19, 2013

mongoDB design tips(1)

No comments:

Post a Comment