Horizontal vs Vertical scaling

Vertical: enhance a specific node - increase node's computer power (eg. increase memory)
Horizontal: increase number of servers - more units that process Vertical generally easier, but limited

Load Balancing

helps system distribute load evenly so that one server does not crash and take all down
done by building a network of cloned servers that have same code and access to DB

joins can get costly in a relational db - very slow as syste grows bigger
denormalization: using tables that are already joint
- (eg. sqlTable for task and project: instead of joining, create joint table - space costly but speed efficient)
noSQL does not support joins but scales better
- It scales better since it does not provide all functionality that sql provides
- eg. noSQL cannot lock tables for atomic operations, cannot enforce reference integrity, transactions, etc.
- Lower functionalityin noSQL allows for simpler data that can be easily partitioned

split data onto different machines but know which machine has what data
smaller DB has faster return time
Typically uses a distributed hash table (DHT) (optimized hash table).
- special key technique that allows for continuous joining, leaving and falling)

Partition based on features(columns)
- If you build a social network, have one partition for profile tables, message tables, etc.
- problem: if table gets very large, need to repartition with different partitioning scheme
Frequently used columns sit on a different DBs from infrequently used columns.
- splitting by columns

A lookup table keeps track of which data is stored in which shard is maintained in the cluster
makes it easy to find additional servers
Problem 1: lookup table is single point of failure
Problem 2: constant access of table hits performance

in-memory caches provide very rapid results
simple key value pairing that sits between application layer and data store
can cache:
- can cache query to result
- can cache objects that might be required
  - eg. Cached web pages, Cached links, etc.

Some processes can run really slow, these must be done async (don't want user to wait)
Async essentially means that process happens later/in background and not on main thread
Eg. Dealing with Stale cache data
- say you cache comments that are to be displayed on a page
- someone else has already commented and so your cache is stale
- you could re-query your cache and then display data, but this is bad UX
- instead, show current data and async re-query cache
- when data is back, can update

write heavy: queue up writes
- write-back caches can be used instead of write-through logic
read heavy: cache reads

Used to process large amounts of data
two parts/functions to define/write, Map and Reduce
Map: takes in data and creates pair
Reduce: takes Map's output and creates a new
pair
- Result may be fed back into the reduce function for more processing
Great for parallel processing of large amounts of data

Machine failure - dealing with machines that may fail
Reliability - Probability that design will operate/last/work for certain time 't'
Availability - percentage of times that the system is operational/usable