HIVE
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. A data warehouse provides a central store of information that can easily be analyzed to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of data using SQL.
Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. What makes Hive unique is the ability to query large datasets, leveraging Apache Tez or MapReduce, with a SQL-like interface.
No one can better explain what Hive in Hadoop is than the creators of Hive themselves: "The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The structure can be projected onto data already in storage."
In other words, Hive is an open-source system that processes structured data in Hadoop, residing on top of the latter for summarizing Big Data, as well as facilitating analysis and queries.
Now that we have investigated what is Hive in Hadoop, let’s look at the features and characteristics.
Hive chiefly consists of three core parts:
Hive Clients: Hive offers a variety of drivers designed for communication with different applications. For example, Hive provides Thrift clients for Thrift-based applications. These clients and drivers then communicate with the Hive server, which falls under Hive services. Hive Services: Hive services perform client interactions with Hive. For example, if a client wants to perform a query, it must talk with Hive services. Hive Storage and Computing: Hive services such as file system, job client, and meta store then communicates with Hive storage and stores things like metadata table information and query results.
Hive's Features
These are Hive's chief characteristics:
Hive is designed for querying and managing only structured data stored in tables Hive is scalable, fast, and uses familiar concepts Schema gets stored in a database, while processed data goes into a Hadoop Distributed File System (HDFS) Tables and databases get created first; then data gets loaded into the proper tables Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar File), and TEXTFILE Hive uses an SQL-inspired language, sparing the user from dealing with the complexity of MapReduce programming. It makes learning more accessible by utilizing familiar concepts found in relational databases, such as columns, tables, rows, and schema, etc. The most significant difference between the Hive Query Language (HQL) and SQL is that Hive executes queries on Hadoop's infrastructure instead of on a traditional database Since Hadoop's programming works on flat files, Hive uses directory structures to "partition" data, improving performance on specific queries Hive supports partition and buckets for fast and simple data retrieval Hive supports custom user-defined functions (UDF) for tasks like data cleansing and filtering. Hive UDFs can be defined according to programmers' requirements Limitations of Hive Of course, no resource is perfect, and Hive has some limitations. They are:
Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but not Online Transaction Processing (OLTP). It doesn’t support subqueries. It has a high latency. Hive tables don’t support delete or update operations. How Data Flows in the Hive? The data analyst executes a query with the User Interface (UI). The driver interacts with the query compiler to retrieve the plan, which consists of the query execution process and metadata information. The driver also parses the query to check syntax and requirements. The compiler creates the job plan (metadata) to be executed and communicates with the metastore to retrieve a metadata request. The metastore sends the metadata information back to the compiler The compiler relays the proposed query execution plan to the driver. The driver sends the execution plans to the execution engine. The execution engine (EE) processes the query by acting as a bridge between the Hive and Hadoop. The job process executes in MapReduce. The execution engine sends the job to the JobTracker, found in the Name node, and assigns it to the TaskTracker, in the Data node. While this is happening, the execution engine executes metadata operations with the metastore. The results are retrieved from the data nodes. The results are sent to the execution engine, which, in turn, sends the results back to the driver and the front end (UI). Since we have gone on at length about what Hive is, we should also touch on what Hive is not:
Hive isn't a language for row-level updates and real-time queries Hive isn't a relational database Hive isn't a design for Online Transaction Processing As we have looked into what is Hive, let us learn about the Hive modes.
Hive Modes
Depending on the size of Hadoop data nodes, Hive can operate in two different modes:
Local mode Map-reduce mode User Local mode when:
Hadoop is installed under the pseudo mode, possessing only one data node The data size is smaller and limited to a single local machine Users expect faster processing because the local machine contains smaller datasets.
Use Map Reduce mode when:
Hadoop has multiple data nodes, and the data is distributed across these different nodes Users must deal with more massive data sets MapReduce is Hive's default mode.