1.
Optimizing Object IDs(using mongodb
ObjectId() method, BSON ObjectId's begin with a timestamp)
The _id
field in a MongoDB document is very important and is always indexed for normal
collections. This page lists some recommendations.
Note
that it is common to use the BSON ObjectID datatype for _id's, but the values
of an _id field can be of any type.Use the collections 'natural primary key' in
the _id field._id's can be any type, so if your objects have a natural unique
identifier, consider using that in _id to both save space and avoid an
additional index.When possible, use _id values that are roughly in ascending
order. If the _id's are in a somewhat well defined order, on inserts the entire
b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Store Binary GUIDs as BinData, rather than as hex encoded strings
BSON
includes a binary data datatype for storing byte arrays. Using this will make
the id values, and their respective keys in the _id index, twice as small.
Note
that unlike the BSON Object ID type (see above), most UUIDs do not have a rough
ascending order, which creates additional caching needs for their index.
2.
Optimizing Storage.
a.
Using the _id Field Explicitly
Mongo
automatically adds an object ID to each document and sets it to a unique value.
Additionally this field in indexed. For tiny objects this takes up significant
space.
The
best way to optimize for this is to use _id explicitly. Take one of your fields
which is unique for the collection and store its values in _id. By doing so,
you have explicitly provided IDs. This will effectively eliminate the creation
of a separate _id field. If your previously separate field was
indexed,
this eliminates an extra index too.
b.
Using Small Field Names
Consider
a record
{
last_name : "Smith", best_score: 3.9 }
The
strings "last_name" and "best_score" will be stored in each
object's BSON. Using shorter strings would save space:
{
lname : "Smith", score : 3.9 }
Field names are not stored in indexes as indexes have a
predefined structure. Thus, shortening field names will not help the size of
indexes. In
general it is not necessary to use short field names.
3.
Querying and nulls
> db.foo.insert( {
x : 1, y : 1 } )
> db.foo.insert( {
x : 2, y : "string" } )
> db.foo.insert( {
x : 3, y : null } )
> db.foo.insert( {
x : 4 } )
//
Query #1
> db.foo.find( {
"y" : null } )
{ "_id" :
ObjectId("4dc1975312c677fc83b5629f"), "x" : 3,
"y" : null }
{ "_id" :
ObjectId("4dc1975a12c677fc83b562a0"), "x" : 4 }
//
Query #2
> db.foo.find( {
"y" : { $type : 10 } } )
{ "_id" :
ObjectId("4dc1975312c677fc83b5629f"), "x" : 3,
"y" : null }
//
Query #3
> db.foo.find( {
"y" : { $exists : false } } )
{ "_id" :
ObjectId("4dc1975a12c677fc83b562a0"), "x" : 4 }
To summarize the
three queries:
documents where y has
the value null or where y does not exist
documents where y has
the value null
documents where y
does not exist
4.
Field Negation
// get
all posts about 'tennis' but without the comments field
db.posts.find(
{ tags : 'tennis' }, { comments : 0 } );
5.
Covered Indexes
Mongod will
automatically use covered index when it can. But be sure that:
a. you provide list
of fields to return, so that it can determine that it can be covered by index
b. you must
explicitly exclude the _id field by using {_id: 0} (unless the index includes
that)
c. as soon as you insert one array value for one of the
index keys, the index will immediately become a multikey index and this
disables covered index functionality
d. use Explain to determine if the covered index is used: the indexOnly field should be true
6.
Retrieving a Subrange of Array Elements
You can
use the $slice operator to retrieve a subrange of elements in an array.
db.posts.find({},
{comments:{$slice: 5}}) // first 5 comments
db.posts.find({}, {comments:{$slice: -5}}) // last 5 comments
db.posts.find({}, {comments:{$slice: [20, 10]}}) // skip 20, limit 10
db.posts.find({}, {comments:{$slice: [-20, 10]}}) // 20 from end, limit 10
Filtering
with $slice does not affect other fields inclusion/exclusion. It only applies
within the array being sliced.
db.posts.find({}, {_id:1, comments:{$slice: 5}}) // first 5 comments, and the
_id field only
7.
Sharding scales reads better than replication
Replication doesn't
scale writes at all, but it can add some scalability as different servers can
service queries. Use sharding to scale out. Sharding
scales
queries just as well as writes, and tends to make better use of the cluster's
total ram than replication would.
8.
When in doubt, query the primary only
Consider the diagram below as an example.
A client in the
primary data center could use slaveOk for reads, querying from both Svr1 and
Svr2. However there are several things to consider in this situation:
Are eventually
consistent reads ok when you hit Svr2?
Do you expect to be
able to continue to run in Data Center 1 if one of the (Svr1/Svr2) pair is
down? If so, and you need both for query load normally, won't you be
overcapacity if one box goes down? That is, there is not n+1 capacity in Data
Center 1. Likewise if two are needed in DC1, you probably need 2 for DR also.
One could counter that one server has sufficient and that using two is just
"a little faster" in terms of latency; however, it is then very hard
to notice if one server truly has enough capacity to handle the entire load.
Do
you have real time alerts if replication is lagging. Imagine a bad ethernet
cable on Svr2 and it is 2 hours behind realtime on its data. You probably do
not want to be querying it then.
One argument for
using slaveOk would be graceful degradation in failure scenarios. During
replica set failover, 20+ seconds could pass before a new primary is elected.
With slaveOk you can read during that period. In addition, if your system were
badly broken (two servers down) the third would be readable, but not be primary
as a majority is needed to elect a primary. One best practice is to run
querying primaries only but have built into your app a "read only
mode" you can enable in an emergency (such as a disaster) so that reads
are still possible; in that mode slaveOk would be used.
Good uses of slaveOk
slaveOk does have its
uses.
Secondaries are great for doing dumps and backups and running reports without
slowing down your front end app / production system.
Secondaries in remote data centers are useful for low latency local queries
9.
BSON type.
db.things.find( { a :
{ $type : 2 } } ); // matches if a is a string
db.things.find( { a :
{ $type : 16 } } ); // matches if a is an int
10.
Regular Expressions
i - Case insensitive.
Letters in the pattern match both upper and lower case letters.
m - Multiline. By
default, Mongo treats the subject string as consisting of a single line of
characters (even if it actually contains newlines).
The "start of
line" metacharacter (^) matches only at the start of the string, while the
"end of line" metacharacter ($) matches only at the
end
of the string, or before a terminating newline.When m it is set, the
"start of line" and "end of line" constructs match
immediately following or immediately before internal newlines in the
subject string,
respectively, as well as at the very start and end. If there are no newlines in
a subject string, or no occurrences of ^ or $ in
a pattern, setting m
has no effect.
x - Extended. If set,
whitespace data characters in the pattern are totally ignored except when
escaped or inside a character class.
Whitespace
does not include the VT character (code 11). In addition, characters between an
unescaped # outside a character class and
the next
newline, inclusive, are also ignored.
This
option makes it possible to include comments inside complicated patterns. Note,
however, that this applies only to data characters.
Whitespace
characters may never appear within special character sequences in a pattern,
for example within the sequence (?( which
introduces
a conditional subpattern.
s - Dot
all. New in 1.9.0. Allows the dot (.) to
11.
Mongo Meta query operators
Only
return the index key:
db.foo.find()._addSpecial("$returnKey" , true )
Show disk
location of results:
db.foo.find()._addSpecial("$showDiskLoc" , true)
Force
query to use the given index:
db.foo.find()._addSpecial("$hint", {_id : 1})
12.
Matching with $elemMatch
// Document 1
{ "foo" : [
{
"shape" :
"square",
"color" :
"purple",
"thick" :
false
},
{
"shape" :
"circle",
"color" :
"red",
"thick" :
true
}
] }
// Document 2
{ "foo" : [
{
"shape" :
"square",
"color" :
"red",
"thick" :
true
},
{
"shape" :
"circle",
"color" :
"purple",
"thick" :
false
}
You want to query for
a purple square, and so you write the following:
db.foo.find({"foo.shape": "square", "foo.color":
"purple"})
or
db.foo.find({foo: {"shape": "square", "color":
"purple"} } )
The problem with this query is that it will match the second in addition to
matching the first document. In other words, the standard query syntax
won't restrict itself
to a single document within the foo array. As mentioned above, subobjects have
to match exactly.
To match an entire
document within the foo array, you need to use $elemMatch. To properly query
for a purple square, you'd use $elemMatch like
so:
db.foo.find({foo:
{"$elemMatch": {shape: "square", color:
"purple"}}})
The query will return
the first document, which contains the purple square you're looking for.
13.
Atomic Operations
MongoDB supports
atomic operations on single documents. MongoDB does not support traditional
locking and complex transactions for a
number of reasons:
First, in sharded
environments, distributed locks could be expensive and slow. Mongo DB's goal is
to be lightweight and fast.
We dislike the
concept of deadlocks. We want the system to be simple and predictable without
these sort of surprises.
We want Mongo DB to
work well for realtime problems. If an operation may execute which locks large
amounts of data, it might stop
some small light
queries for an extended period of time. (We don't claim Mongo DB is perfect yet
in regards to being "real-time", but we
certainly think
locking would make it even harder.)
The Mongo DB update
command supports several modifiers, all of which atomically update an element
in a document. They include:
$set - set
a particular value
$unset -
delete a particular field (v1.3+)
$inc -
increment a particular value by a certain amount
$push -
append a value to an array
$pushAll -
append several values to an array
$pull -
remove a value(s) from an existing array
$pullAll -
remove several value(s) from an existing array
$bit -
bitwise operations
These modifiers are convenient
ways to perform certain operations atomically.
Another strategy for
atomic updates is "Update if Current".
This is what an OS
person would call Compare and Swap. For this we
1.Fetch the object.
2.Modify the object
locally.
3.Send an update
request that says "update the object to this new value if it still matches
its old value".
Should the operation
fail, we might then want to try again from step 1.
14.
The ABA Nuance
However, what if
since our read, object had been modified?
We would then
overwrite that change and lose it!
There are several
ways to avoid this problem ; it's mainly just a matter of being aware of the
nuance.
a. Use the
entire object in the update's query expression, instead of just the _id and qty
field.
b. Use $set
to set the field we care about. If other fields have changed, they won't be
affected.
c. Put a
version variable in the object, and increment it on each update.
d. When
possible, use a $ operator instead of an update-if-current sequence of
operations.
15.
Operations Lock Type
16.
How many times is data written to disk when
replication and journaling are both on?
In v1.8, for an
insert, four times. The object is written to the main collection, and also the
oplog collection (so that is twice). Both of those writes
are
journaled as a single mini-transaction in the journal file (the files in
/data/db/journal). Thus 4 times total.
There is an open item
in to reduce this by having the journal be compressed. This will reduce from 4x
to probably ~2.5x.
The above applies to
collection data and inserts which is the worst case scenario. Index updates are
written to the index and the journal, but not
the oplog, so they
should be 2X today not 4X. Likewise updates with things like $set, $addToSet,
$inc, etc. are compactly logged all around so
those are
generally small.
17.
Prealloc Files
mongod will create
prealloc files in the journal directory under some circumstances to minimize
journal write latency. On some filesystems,
appending to a file
and making it larger can be slower than writing to a file of a predefined size.
mongod checks this at startup and if it finds this to
be the case will use
preallocated journal files. If found to be helpful, a small pool of prealloc
files will be created in the journal directory before
startup begins. This
is a one time initiation and does not occur with future invocations.
Approximately 3GB of files will be preallocated (and truly
prewritten, not
sparse allocated) - thus in this situation, expect roughly a 3 minute delay on
the first startup to preallocate these files.
If you don't want to
wait three minutes on startup, you can preallocate the files using another
instance of mongod and then move them to your
normal dbpath before
starting with journaling.
No comments:
Post a Comment