What is Big Data Architecture?
A Big Data architecture is the basis for Big Data Analysis, and it is developed to manage the input, processing, and analysis of data that are too big or complex to be handled by traditional DB systems. Typically, Big Data solutions encompass one or more of the following working load types: batch processing of inactive Big Data sources; real-time processing of Big Data on the go; the interactive exploration of Big Data and predictive Analysis and Machine Learning.
Most architectures for Big Data include some or all of the following components:
- Data sources: The starting point of all Big Data solutions is one or more data sources. For example, application data archives such as relational databases; application data archives such as relational databases; static files generated by applications such as Web server log files; real-time data sources such as IoT devices.
- Data storage: Batch processing data is usually placed in a batch file warehouse containing high volumes of large-scale files in multiple formats.
- Batch processing: As datasets have a considerable size, Big Data solution often has to process data files through time-consuming batch processes to filter, aggregate, and otherwise prepare the data for analysis.
- Real-time message input: If the solution includes real-time sources, the architecture must include capturing and archiving messages in real-time for flow (stream) processing.
- Flow processing: After the acquisition of real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis.
- Analytical data archive: Many Big Data solutions prepare data for analysis and then serve the processed data in a structured format on which it is possible to perform queries with analytical tools.
- Analysis and report creation: Big Data solutions can provide detailed information about the data through analysis tools and reports.
- Orchestration: Most Big Data solutions consist of repeated operations (of data processing), encoded in workflows, which transform source data, move data between multiple sources and sinks, load the processed data into an analytical data storage or insert it directly into a dashboard.