1.
Linux File Systems
MongoDB uses large
files for storing data, and preallocates these. These filesystems seem to work
well:
ext4 ( kernel version
>= 2.6.23 )
xfs ( kernel version
>= 2.6.25 )
In addition to the
file systems above you might also want to (explicitly) disable file/directory
modification times by using these mount options:
noatime (also enables
nodiratime)
We have found ext3 to
be very slow in allocating files (or removing them) as well as access within
large files is also poor.
2.
What Hardware?
MongoDB tends to run
well on virtually all hardware. In fact it was designed specifically with
commodity hardware in mind (to facilitate cloud
computing); that said
it works well on very large servers too. That said if you are about to buy
hardware here are a few suggestions:
1.Fast CPU clock
speed is helpful.
2.Many cores helps
but does not provide a high level of marginal return, so don't spend money on
them. (This is both a consequence of the
design of
the program and also that memory bandwidth can be a limiter; there isn't
necessarily a lot of computation happening inside a
database).
3.NUMA is not very
helpful as memory access is not very localized in a database. Thus non-NUMA is
recommended; or configure NUMA
as
detailed elsewhere in this document.
4.RAM is good.
5.SSD is good. We
have had good results and have seen good price/performance with SATA SSDs; the
(typically) more upscale PCI SSDs
work fine
too.
6.Commodity (SATA)
spinning drives are often a good option as the speed increase for random I/O
for more expensive drives is not that
dramatic (only on the
order of 2x) – spending that money on SSDs or RAM may be more effective.
3.
PCI vs. SATA
SSD is available both
as PCI cards and SATA drives. PCI is oriented towards the high end of products
on the market.
Some SATA SSD drives
now support 6Gbps sata transfer rates, yet at the time of this writing many
controllers shipped with servers are 3Gbps.
For random IO
oriented applications this is likely sufficient, but worth considering
regardless.
4.
RAM vs. SSD
Even though SSDs are
fast, RAM is still faster. Thus for the highest performance possible, having
enough RAM to contain the working set of data
from the database is
optimal. However, it is common to have a request rate that is easily met by the
speed of random IO's with SSDs, and SSD
cost per byte is
lower than RAM (and persistent too).
A system with less
RAM and SSDs will likely outperform a system with more RAM and spinning disks.
For example a system with SSD drives and
64GB RAM will often
outperform a system with 128GB RAM and spinning disks. (Results will vary by
use case of course.)
One helpful
characteristic of SSDs is they can facilitate fast "preheat" of RAM
on a hardware restart. On a restart a system's RAM file system
cache must be
repopulated. On a box with 64GB RAM or more, this can take a considerable
amount of time – for example six minutes at
100MB/sec, and much
longer when the requests are random IO to spinning disks.
5.
FlashCache
FlashCache is a write
back block cache for Linux. It was created by Facebook. Installation is a bit
of work as you have to build and install a kernel
module. Sep2011: If
you use this please report results in the mongo forum as it's new and everyone
will be curious how well it works.
6.
OS scheduler
One user reports good
results with the noop IO scheduler under certain configurations of their
system. As always caution is recommended on
nonstandard
configurations as such configurations never get as much testing...
7.
Virtualization
Generally MongoDB
works very well in virtualized environments, with the exception of OpenVZ.
8.
Replication
Two forms
of replication are available, Replica Sets and Master-Slave. Use Replica Sets –
replica sets are a functional superset of master/slave and are handled by much
newer, more robust code.
9.
Replica Set Design Concepts
majority
of the set is not up or reachable, no member will be elected primary.
There is
no way to tell (from the set's point of view) the difference between a network
partition and nodes going down, so members left in a
minority
will not attempt to become master (to prevent a set from ending up with masters
on either side of a partition).
This means
that, if there is no majority on either side of a network partition, the set
will be read only (thus, we suggest an odd number of servers:
e.g., two
servers in one data center and one in another). The upshot of this strategy is
that data is consistent: there are no multi-master conflicts
to
resolve.
There are
several important concepts concerning data integrity with replica sets that you
should be aware of:
1. A write
is (cluster-wide) committed once it has replicated to a majority of members of
the set.
For
important writes, the client should request acknowledgement of this with a
getLastError({w:...}) call. (If you do not call getLastError,
the
servers do exactly the same thing; the getlasterror call is simply to get
confirmation that committing is finished.)
For
important writes, use the getLastError to
ensure cluster-wide commits of critical writes.
2. Queries
in MongoDB and replica sets have "READ UNCOMMITTED" semantics.
Writes
which are committed at the primary of the set may be visible before the
cluster-wide commit completes.
The read
uncommitted semantics (an option on many databases) are more relaxed and make
theoretically achievable performance and
availability
higher (for example we never have an object locked in the server where the
locking is dependent on network performance).
3. On a
failover, if there are writes which have not replicated from the primary, the
writes are rolled back. Thus we use getlasterror as in #1
above when
we need to confirm a cluster-wide commit.
The data
is backed up to files in the rollback directory, although the assumption is
that in most cases this data is never recovered as that would
require
operator intervention. However, it is not "lost," it can be manually
applied at any time with mongorestore.
Rationale
Merging
back old operations later, after another node has accepted writes, is a hard
problem. One then has multi-master replication, with potential
for
conflicting writes. Typically that is handled in other products by manual
version reconciliation code by developers. We think that is too much
work : we
want MongoDB usage to be less developer work, not more. Multi-master also can
make atomic operation semantics problematic.
It is
possible (as mentioned above) to manually recover these events, via manual DBA
effort, but we believe in large system with many, many
nodes that
such efforts become impractical.
Comments
Some
drivers support 'safe' write modes for critical writes. For example via
setWriteConcern in the Java driver.
Additionally,
defaults for { w : ... } parameter to getLastError can be set in the replica
set's configuration.
Note a
call to getLastError will cause the client to have to wait for a response from
the server. This can slow the client's throughput on writes if
large
numbers are made because of the client/server network turnaround times. Thus
for "non-critical" writes it often makes sense to make no
getLastError
check at all, or only a single check after many writes.
10.
Replica Set Commands
11.
Forcing a Member to be Primary
In v2.0+,For
example, if we had members A, B, and C and A is the current primary and we want
B to be primary, we could give B a higher priority like so:
> config =
rs.conf()
{
"_id" :
"foo",
"version" :
1,
"members" :
[
{
"_id" : 0,
"host" :
"A",
},
{
"_id" : 1,
"host" :
"B",
},
{
"_id" : 2,
"host" :
"C",
}
]
}
> config.version++
> // the default
priority is 1
>
config.members[1].priority = 2
>
rs.reconfig(config)
Assuming
B is synced to within 10 seconds of A, A will step down, B (and C) will catch
up to where A is, and B will be elected primary.
If
B is far behind A, A will not step down until B is within 10 seconds of its
optime. This minimizes the amount of time there will be no primary on
failover.
If you do not care about how long the set is primary-less, you can force A to
step down by running:
db.adminCommand({replSetStepDown:1000000,
force:1})
12.
Changing
Config Servers
Best
Practices
A.Use
three config servers.
B.Give
them logical (DNS) names. Do not refer to them (on the mongos ocmmand line) by
IP address.
C.The DNS
name should be logical (e.g. "mycluster_config2") not a name that is
VM-specific.
Use a
short DNS TTL (20 seconds or example).
D.Act
slowly when replacing a config server – your system should be fine with one
down. Better to not have an operational typo and to let
one be
down for a while. If one is down, take some time and read this page, consider
opening a support ticket, etc., before taking action.
The
metadata for the cluster is very important.
E.When
replacing a config server, make sure the old process/machine is truly down
before starting up the new one. There may be legacy
sockets
open to the old config server process otherwise, and that would be bad.
13.
Simple
Initial Sharding Architecture
Goal
Two datacenters
(East=primary, West=backup/DR)
Data Tier
(MongoDB)
3 shards
3 nodes per shard
9 hosts total
Application Tier
4 application servers
Machines Locations
e1-e6 are in 1
datacenter (East)
w1-w3 are in another datacenter (West)
Datacenter Roles
We'll use
datacenter East as the primary, and data center West as disaster recovery.
14.
Licensing
Database:
Free Software Foundation's
GNU AGPL v3.0.
Commercial licenses
are also available from 10gen, including free evaluation licenses.
Drivers:
mongodb.org supported
drivers: Apache License v2.0.
Third parties have
created drivers too; licenses will vary there.
No comments:
Post a Comment