Sunday, March 4, 2012

YQL and Vitess

1.YQL: Yahoo! Query Language (YQL)

http://developer.yahoo.com/yql/

YQL:

USE "http://your_domain_name/bart.xml" AS bart_table; SELECT name,eta FROM bart_table WHERE eta.destination LIKE "%SF%"

2.google vitess

https://code.google.com/p/vitess/wiki/ProjectGoals

Motivation and vision

Vtocc is the first usable product of vitess. It acts as a front-end to MySQL providing an RPC interface that accepts and transmits SQL commands. It is capable of efficiently multiplexing a large number of incoming connections (10K+) over a small number of db connections at reasonable throughput (~10kqps). It also has an SQL parser which gives the server the ability to understand and intelligently reshape the queries it receives.

Vtocc is already being used in a large scale production environment. It is the core of YouTube's new MySQL serving infrastructure.

Relational databases (like MySQL) were initially built and optimized for non-web OLTP systems. However, they have still managed to fulfill most of the needs of large scale web applications. Many of their legacy requirements have held them back from being able to optimally meet the needs of today’s applications. This has led to the development of alternate storage solutions (NoSQL) which in some respects have thrown the baby out with the bathwater.

With Vitess, we take a different approach. We think that databases like mysql have what it takes to provide an efficient data storage layer. What is missing is the ability to easily scale out and then coordinate many instances of a single logical schema. The way we plan to achieve this is by providing a subset SQL front end with limited guarantees and a loosely coupled distributed system to automate complex management scenarios.

  • Sessionless connections: A mysql connection carries a lot of context. This is usually unnecessary for web apps. Most Vitess connections are sessionless, which makes them lightweight, and this allows us to pool a very small number of connections to mysql.
  • ACID (Atomicity): This feature is an overkill for most web apps. Instead of honoring system-wide atomicity, Vitess gives you atomic guarantees for a given entity addressed by a “key” (like a user id). This allows us to transparently scale the database by splitting on key ranges.
  • ACID (Consistency): Vitess relaxes this rule towards data being eventually consistent. This allows us to use replication to distribute read traffic in situations where the app does not need up-to-date data. You can always request data to be read from the master db if you explicitly need up-to-date data.
  • Buffer cache vs. row cache: The mysql buffer cache is optimized for range scans over indices and tables, particularly when data is densely packed. Unfortunately, it’s not good for random access tables. Vitess will allow you to designate certain tables as random access. For such cases, it will maintain row based caches and keep them consistent by fielding all DMLs that could potentially affect them.

Vitess feature set

Sharding

  • All tables in a sharded database need to contain a “key” column. Vitess will use these values to decide the target shard for such data.
  • All tables that are indexed by a set of keys are known as a keyspace, which basically represents the logical database that combines all the shards that store them.
  • We are going with range based sharding. The main advantage of this scheme is that the shard map is a simple in-memory lookup. The downside of this scheme is that it creates hot-spots for sequentially increasing keys. In such cases, we recommend that the application hash the keys so they distribute more randomly.
  • The implied requirement from the above constraints is that all DMLs are expected to specify the key that is being updated. This also implies that the key will usually be the leading primary key column.
  • The shard key data type is expected to be a natural number or a varbinary. We do not support varchar for keys because of character set complications that contradict range-based sharding schemes.
  • When a shard becomes too ‘hot’, Vitess can decide to split it into two. This is done through filtered replication.
  • Vitess will also have the ability to merge shards.

Replication

  • You can specify a replication factor for a keyspace. This will make Vitess create that many replicas for each master database.
  • A shard map is a single list of databases that spans the entire keyspace. For example, a master shard map will contain all the master databases for the keyspace.
  • There will be a server (wrangler) that can serve these shard maps to the application based on what it requires. For example, the application can request a “random” replica shard map for reading data, or it could request the master shard map if it wishes to write data, or needs to read up-to-date info.
  • Vitess will support multiple data centers. It assumes that there is only one master data center, where all the master databases reside. Each data center will have a 'wrangler' server that will serve information about the local shard map in that data center.
  • Vitess is capable of electing a new master for maintenance (or due to failure). It will also provide a scheme to correspondingly reparent all downstream replicas automatically.

Schema rollout

Many DDLs cannot be performed on high traffic live systems due to their locking requirements. Vitess will allow you to coordinate such rollout with un-noticeable downtime by deploying the DDL to offline replicas and a reparenting process.



No comments:

Post a Comment