Horizontal vs Vertical scaling
- Vertical: enhance a specific node - increase node's computer power (eg. increase memory)
- Horizontal: increase number of servers - more units that process
Vertical generally easier, but limited
Load Balancing
- helps system distribute load evenly so that one server does not crash and take all down
- done by building a network of cloned servers that have same code and access to DB
Database denormalization and noSQL
- joins can get costly in a relational db - very slow as syste grows bigger
- denormalization: using tables that are already joint
- (eg. sqlTable for task and project: instead of joining, create joint table - space costly but speed efficient)
- noSQL does not support joins but scales better
- It scales better since it does not provide all functionality that sql provides
- eg. noSQL cannot lock tables for atomic operations, cannot enforce reference integrity, transactions, etc.
- Lower functionalityin noSQL allows for simpler data that can be easily partitioned
Database partitioning (Sharding)
- split data onto different machines but know which machine has what data
- smaller DB has faster return time
- Typically uses a distributed hash table (DHT) (optimized hash table).
- special key technique that allows for continuous joining, leaving and falling)
Vertical Partitioning (row splitting)
- Partition based on features(columns)
- If you build a social network, have one partition for profile tables, message tables, etc.
- problem: if table gets very large, need to repartition with different partitioning scheme
- Frequently used columns sit on a different DBs from infrequently used columns.
Key-based (Hash-based) partitions
- take a key in the table
- take number of servers, say n
- hash the key value to get data from specific server
- Problem: increase in server => redistribution of data (costy)
Directory based partitioning
- A lookup table keeps track of which data is stored in which shard is maintained in the cluster
- makes it easy to find additional servers
- Problem 1: lookup table is single point of failure
- Problem 2: constant access of table hits performance
Cache
- in-memory caches provide very rapid results
- simple key value pairing that sits between application layer and data store
- can cache:
- can cache query to result
- can cache objects that might be required
- eg. Cached web pages, Cached links, etc.
Processing operations asynchronously and queueing
- Some processes can run really slow, these must be done async (don't want user to wait)
- Async essentially means that process happens later/in background and not on main thread
- Eg. Dealing with Stale cache data
- say you cache comments that are to be displayed on a page
- someone else has already commented and so your cache is stale
- you could re-query your cache and then display data, but this is bad UX
- instead, show current data and async re-query cache
- when data is back, can update
Read heavy vs Write heavy
- write heavy: queue up writes
- write-back caches can be used instead of write-through logic
- read heavy: cache reads
Networking
Bandwidth
- Max data transferred per unit time
- bits/second
Throughput
- Actual data transferred per unit time
- bits/second
Latency
- Time taken for data to get from sender to receiver
Example
- Say got converoy belt from A to B
- fatter belt => more bandwith & throughput
- shorter distance => lesser latency (good)
- faster belt => better for all
MapReduce
- Used to process large amounts of data
- two parts/functions to define/write, Map and Reduce
- Map: takes in data and creates pair
- Reduce: takes Map's output and creates a new pair
- Result may be fed back into the reduce function for more processing
- Great for parallel processing of large amounts of data
Other considerations
- Machine failure - dealing with machines that may fail
- Reliability - Probability that design will operate/last/work for certain time 't'
- Availability - percentage of times that the system is operational/usable