current position:Home>(Java language version) spark + hive + hbase offline data analysis, statistics and efficiency experiment

(Java language version) spark + hive + hbase offline data analysis, statistics and efficiency experiment

2022-09-23 10:18:51Actually I'm real

First, let me introduce the premise of my experiment,My data is all thereHbase中的,Now my needs are to be rightHbasedata in big data analysis.

这里有一个前提条件,我的Hbase里的数据都是String类型,This time there is oneHbase排序的问题.因为hbase是根据asciicode is sorted lexicographically,那么比如说:Speed ​​is a field of type number,我保存成String以后,速度20,Actually than speed3小.If you don't understand, you can go to Baidu yourselfHbasedata ordering rules.由于这个原因,If I want to query speed<20There are some difficulties with all the data,而且只用Hbase的scanThe query efficiency is also very slow.Therefore, other big data technologies need to be used to meet my needs.

There are many experiments I have done below,The purpose is to find a fast big data analysis solution,目前的阶段,I just need to generate the report,without data export.

Of course the final solution is to use spark进行大数据分析,After all, this is the current mainstream technology.

First, let me introduce my experimental environment,我这里是用3The smallest cluster built by one machine,Intern's production environment is at least faster than my test environment10倍以上,Of course, it is impossible to test the efficiency in the production environment at this stage.我这里用的是linux centos6.5虚拟机,16G内存,8核CPU.

集群版本:我这里用的是HDP的Ambari集成的环境,Tell me the version of the technology I'm using,因为版本不同,Code writing and precautions are also different,It can only be used as a reference if the environment is different from mine.

HDFS 2.7.1.2.4

MapReduce2 2.7.1.2.4

YARN 2.7.1.2.4

Hive 1.2.1.2.4

HBase 1.1.2.2.4

Spark 1.6.0.2.4

Here first I initialized1000万条数据,作为数据集.

实验一:  Hive与HBase结合

Hive可以解决HBase排序问题.Because there must be usedHBase这个前提,所以我创建的是 EXTERNAL  表,即Hive的外部表.具体创建HiveTable statement refer to my other post,地址如下:

https://blog.csdn.net/lwb314/article/details/80346993

The experiment I'm doing here is queryingSOC字段<20;

首先是hive,代码如下

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.Statement;

public class HiveTest2 {

	/*
	 * There was an example of an error on the Internet before,The load path isorg.apache.hadoop.hive.jdbc.HiveDriver,多了一个hadoop,It should be an older version
	 * 我用的hive版本是1.2.1
	 */
	private static String driverName = "org.apache.hive.jdbc.HiveDriver";

	public static void main(String[] args) {
		try {
			Class.forName(driverName);
			Connection con = null;
			con = DriverManager.getConnection("jdbc:hive2://10.10.171.169:10000", "hive", "hive");// 之前是jdbc:hive:It may also be written in an old version
			Statement stmt = con.createStatement();
			ResultSet res = null;
			// String sql =
			// "select * from lwb_test1 where vin = 'LBVHY1101JMK00005' and soc<10 limit 10";
			String sql = " select count(1) from lwb_test1 where soc<20";
			System.out.println("");
			System.out.println("Running: " + sql);
			long startTime = System.currentTimeMillis();
			res = stmt.executeQuery(sql);
//			ResultSetMetaData rsm = res.getMetaData(); // 获取列名
//			for (int i = 0; i < rsm.getColumnCount(); i++) {
//				System.out.print(rsm.getColumnName(i + 1).split("\\.")[1] + "\t");//输出列名
//			}
			System.out.println();
			long k = 0;
			while (res.next()) {
				k++;
				//System.out.println(res.getString(1));
//				System.out.println(res.getString(1) + "\t" + res.getString(2) + "\t" + res.getInt(3) + "\t" + res.getInt(4) + "\t"
//						+res.getString(5) + "\t" + res.getString(6) );
			}
			long endTime = System.currentTimeMillis();
			System.out.println("用时:" + (endTime - startTime));
			System.out.println("总共条数:" + k);
			stmt.close();
			con.close();
		} catch (Exception e) {
			e.printStackTrace();
			System.out.println("error");
		}

	}

}

结果:

Running:  select count(1) from lwb_test1 where soc<20

记录条数:2001000
用时:50430

这里是从hive的1000Query out the data in the 10,000 data bar200The number of thousands,条件查询,用时50秒.这是count

I modified the statement to be as follows,Query all fields to traverse

select * from lwb_test1 where soc<23Then mask the print statement in the upper loop,Printing reduces execution efficiency

程序执行结果如下:

Running:  select * from lwb_test1 where soc<22
用时:160525
总共条数:2201156

以下是我的HBase查询代码

import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.HashMap;
import java.util.NavigableMap;
import java.util.Map.Entry;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.util.Bytes;


public class SocScan {

	public static Configuration configuration;
	public static String zkHost = "devhadoop3,devhadoop2,devhadoop1";
	public static String zkPort = "2181";
	public static String zkParent = "/hbase-unsecure";
	private static Connection connection;

	static {
		configuration = HBaseConfiguration.create();
		configuration.set("hbase.zookeeper.quorum", zkHost);
		configuration.set("hbase.zookeeper.property.clientPort", zkPort);
		configuration.set("zookeeper.znode.parent", zkParent);
		try {
			connection = ConnectionFactory.createConnection(configuration);
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	public static void main(String[] args) throws Exception {
		try {

			Table table = connection.getTable(TableName.valueOf("lwb_test1"));
			Scan scan = new Scan();
			// scan.setStartRow(start.getBytes());
			// scan.setStopRow(stop.getBytes());
			// scan.setReversed(true);//
			// 倒序,The beginning and the end of the reverse sequencerowkeyIt has to be reversed too,For example not set beforestart是1,stop是3,Then after settingstart是3,stop是1
			Filter filter = new SingleColumnValueFilter("data".getBytes(), "soc".getBytes(), CompareOp.LESS, "29".getBytes());
			scan.setFilter(filter);
			scan.setCaching(1000);
			scan.setCacheBlocks(false);
			long startTime = System.currentTimeMillis();
			ResultScanner scanner = table.getScanner(scan);
			byte[] vin = "vin".getBytes();
			byte[] soc = "soc".getBytes();
			long k=0;
			for (Result result : scanner) {
//				NavigableMap<byte[], byte[]> map = result.getFamilyMap("data".getBytes());
//				System.out.println(new String(map.get(vin)) + "," + new String(map.get(soc)));
				k++;
			}
			long endTime = System.currentTimeMillis();
			System.out.println("用时:"+(endTime-startTime));
			System.out.println("总共条数:"+k);
			scanner.close();
			table.close();
			connection.close();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

Hbase查询结果:

用时:83657
总共条数:2200369

 

实验一结论:

简单条件查询,基数相同,query result set under the same conditions,HiveThe time is basicallyHbase的一倍.但是Hive查询结果是正确的.

 

实验二:使用hiveQuery the internal table

首先创建一张Hive自己的表,就是创建hbaseThe first line of code for the association table.There is no data in the table after creation,Then the statement to copy the data here is as follows:

insert into lwb_test3 select * from lwb_test1;

This is straight from beforelwb_test1Find all the data in the table,存入lwb_test3;

It still works after thathive代码,Just change the table name.可以看到,The time is basically the sameHbaseScan快一点,多次测试,基本和HbaseScan时间一样.

Running:  select * from lwb_test3 where soc<22

用时:79297

总共条数:2201156

实验二结论:

hiveThe lookup time using the inner table is half that of the outer table,和HbaseScanThe efficiency is basically the same,而且结果正确.

 

 

copyright notice
author[Actually I'm real],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/266/202209231010447169.html

Random recommended