On the most fundamental level, a web application consists of client-side code and markup, server-side code and a database. Many small websites and small-to-medium enterprise-level browser-based applications consist of nothing other than this components. However, on its own, this basic setup is not suitable for a big data application.
Any web application that is intended to manage large quantities of data would need the following components:
It must be available at any time, regardless of how many users are using it simultaneously. This is why it is absolutely crucial that a big data application doesn't have any single points of failure. Therefore, it would not be possible to achieve high availability if either all of the code or the entire database exists on a single server as a single instance.
When you start typing a new email in Gmail, it instantly saves a draft on the server and constantly updates it while you are making further changes. When you write a post on Facebook, all of your friends can see it instantly. This level of performance is achieved despite the fact that both of the apps have millions of user interactions at any given time.
To achieve this level of performance, a web application shouldn't have any bottlenecks. Therefore, once again, having a single instance of either code base or the database is not suitable. Also, there is a limit to how much hardware you can carry on adding to a single server to increase the performance. As the usage of an app grows linearly, the response time grows logarithmically. Therefore, at some point, adding more hardware will result in diminishing returns.
As an organisation processes large quantities of user-generated data, it need to take full advantage of it, both to help the organisation to make effective business decisions and to deliver the most relevant content to the users of its products.
These are some of the best techniques that developers of big data applications use to achieve the goals of high availability, high performance and analytics:
Instead of having a single server to execute the logic within a web application, the same code may be duplicated on several servers. When a request from a user's browser is received, the system sends it to a server that is not overloaded with other requests. As the load is distributed between many servers, the system never gets into a state where response time increases to unacceptable level.
There are many ways to implement load balancing. Microsoft Azure, for example, has a service that constantly sends requests to each server within a cluster to check if the server would be able to accept an incoming request from a user. The exact parameters of a good server state can be configured. For example, a server may be set to not respond with an OK status code if its current CPU usage is above 90%. If a particular server does not respond with OK status code within an acceptable time frame, the system may be set to try to resolve any issues with the server, but the immediate action would be to stop sending any users' requests to that server and chose a different one instead.
Load balancing provides both performance and availability. If implemented properly, none of the servers within a load-balanced system are allowed to process more than they can handle without significantly slowing down. Likewise, if any server becomes unavailable, the system would still be able to cope with remaining servers.
Databases can also be load-balanced, but this is mainly done for availability rather than performance. Microsoft SQL Server AlwaysON Availability Group, for example, allows a .NET application with Entity Framework to implement an execution strategy, which, for example, can be set to resubmit a SQL command to the database cluster if the server that the application was previously connected to suddenly became unavailable during the initial execution of the command.
If a system has a very large number of users that generate a very large amount of content on constant basis, you cannot just use one huge database. Writing data to a disk and reading data from it are relatively computationally expensive operations. So is performing a scan on a huge database table. Therefore, even if you had a very powerful server, your system would grind to a halt due to the sheer number of simultaneous reads, writes and table scans.
Load balancing provides only a partial solution to this problem. Although it would reduce the number of simultaneous reads and writes, it will not eliminate the need to scan large tables or indexes. Therefore, as more and more data will be generated, the performance will still eventually decrease to unacceptable levels.
One of the best solutions for this is database sharding, which is also known as "nothing shared" architecture. This is a technique of having several databases with the same schema, but different data rows. For example, if an application has millions of users, the users will be split into subsets, each of which will be stored in a separate database on its own server, while each of the databases will be a part of the same cluster and have the same data structure. Just like a database table may have a primary key, each database in the cluster would have its own primary key to enable the system to quickly identify which cluater node needs to be queried when a particular request is submitted. For example, users my be grouped by geographic locations, so location code would form a unique identifier of a cluster. This model is known as KKV or Key Key Value and is widely implemented by database systems that have been specifically designed for sharding, such as Cassandra.
Although this architecture is referred to as "nothing shared", this doesn't accurately describe a sharded database structure. To increase performance of the queries, there would be some degree of duplication, which mainly relates to lookup tables and reference data records.
Of course, sharding doesn't eliminate the need for load balancing, as it improves the performance but has virtually no effect on the availability. Therefore software architects would need to ensure that every node within the cluster is sufficiently replicated.
Making sense of the data
Any organisation that manages big data would be aiming to take full advantage of it. Therefore, the process of big data analytics would be required to uncover hidden patterns of the user's' behaviour. This is what makes you see personalised content on various websites and ads that are relevant to what you have been searching for on the web. On a larges scale, big data analytics is what drives business decisions for the companies that manage the data.
The field of big data analysis is vast. Some of it is automated via various business intelligence tools. HBase for example, a NoSQL database that forms a part of Apache Hadoop database clustering solution, comes with its own set of analytic tools. So do more traditional relational database systems such as SQL Server and Oracle. However, as well as using these tools, many organisations develop bespoke in-house solutions with the help of specially qualified data scientists.
This article was only a brief introduction to how big data applications are build. If you are interested to find out more, you will find the following links useful:
Written by Fiodar Sazanavets
Posted on 25 Jan 2017