Sign in

In this Snowflake data warehouse article, I will explain how to read a Snowflake table into Spark DataFrame and learn different connection properties with the Scala language.

Pre-requisites

  • Snowflake data warehouse account
  • Basic understanding in Spark and IDE to run Spark programs

If you are reading this tutorial, I believe you already know what is Snowflake database, in case if you are not aware, in simple terms Snowflake database is a purely cloud-based data storage and analytics Data Warehouse provided as a Software-as-a-Service (SaaS).

Snowflake database is architecture and designed an entirely new SQL database engine to work with cloud…


In this Spark article, you will learn how to convert Parquet file to Avro file format with Scala example, In order to convert first, we will read a Parquet file into DataFrame and write it in a Avro file.

What is Apache Parquet

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON, supported by many data processing systems.

It is compatible with most of the data processing frameworks in the Hadoop echo systems. …


We often need to create empty RDD in Spark, and empty RDD can be created in several ways, for example, with partition, without partition, and with pair RDD. In this article, we will see these with Scala, Java and Pyspark examples.

Spark sc.emptyRDD — Creates empty RDD with no partition

In Spark, using emptyRDD() function on the SparkContext object creates an empty RDD with no partitions or elements. The below examples create an empty RDD.

val spark:SparkSession = SparkSession.builder() .master("local[3]") .appName("SparkByExamples.com") .getOrCreate() val rdd = spark.sparkContext.emptyRDD // creates EmptyRDD[0] val rddString = spark.sparkContext.emptyRDD[String] // creates EmptyRDD[1] println(rdd) println(rddString) println("Num of Partitions: "+rdd.getNumPartitions) // returns o partition

From the above…


Spark SQL provides Encoders to convert case class to the spark schema (struct StructType object), If you are using older versions of Spark, you can create spark schema from case class using the Scala hack. Both options are explained here with examples.

First, let’s create a case class “Name” & ‘Employee”

case class Name(first:String,last:String,middle:String) 
case class Employee(fullName:Name,age:Integer,gender:String)

Using Spark Encoders to convert case class to schema

Echoders class has a method product that takes scala class "Employee" as a type and uses the schema method to return the schema of an Employee class.

val encoderSchema = Encoders.product[Employee].schema encoderSchema.printTreeString()

printTreeString() on schema object outputs the below schema.

root 
|-- fullName…

This tutorial explains different Spark connectors and libraries to interact with HBase Database and provides a Hortonworks connector example of how to create DataFrame from and Insert DataFrame to the table.

On the internet, you would find several ways and API’s to connect Spark to HBase and some of these are outdated or not maintained properly. Here, I will explain some libraries and what they are used for and later will see some spark SQL examples.

Let’s see these in detail.

Apache HBase Client

Apache hbase-client API comes with HBase distribution and you can find this jar in /lib at your installation directory…


Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON, supported by many data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically capture the schema of the original data. Parquet also reduces data storage by 75% on average. Below are some advantages of storing data in Parquet format.

  • Reduces IO operations.
  • Fetches specific columns that you need to access.
  • Consumes less space.
  • Support type-specific encoding.

Let’s see some examples using Parquet .

val data = Seq(("James ","","Smith","36636","M",3000), ("Michael…


Like SQL “case when” statement, Spark also supports similar syntax using when otherwise or we can also use case when statement. So let’s see an example on how to check for multiple conditions and replicate SQL CASE statement in Spark

First Let’s do the imports that are needed, create spark context and dataframe.

import org.apache.spark.sql.functions.{when, _} val spark: SparkSession = SparkSession.builder().master("local[1]") .appName("SparkByExamples.com").getOrCreate() import spark.sqlContext.implicits._ val data = List(("James ","","Smith","36636","M",60000), ("Michael ","Rose","","40288","M",70000), ("Robert ","","Williams","42114","",400000), ("Maria ","Anne","Jones","39192","F",500000), ("Jen","Mary","Brown","","F",0)) val cols = Seq("first_name","middle_name","last_name","dob","gender","salary") val df = spark.createDataFrame(data).toDF(cols:_*)

Using “when otherwise” on dataframe.

val df2 = df.withColumn("new_gender", when(col("gender") === "M","Male").when(col("gender") === "F","Female").otherwise("Unknown"))

“when” is a spark function; we should…


This post explains different approaches to create DataFrame ( createDataFrame() ) in Spark using Scala example, for e.g how to create DataFrame from an RDD, List, Seq, TXT, CSV, JSON, XML files, Database e.t.c. Though we have covered most of the examples in Scala here, the same concept can be used to create DataFrame in PySpark (Python Spark)

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. …


Kafka Delete Topic — Every message Apache Kafka receives stores it in log and by default, it keeps the messages for 168 hrs which is 7 days. To delete the topic or all its messages can be done in several ways and the rest of the article explains these.

Before you try the below examples, make sure you have at least one Kafka topic created and have some messages in it. And when you run these commands on production be extra cautious as these are permanent and you can’t retrieve the messages back.

1. Delete all messages from the topic

Apache Kafka distribution comes with bin/kafka-configs.sh script…

NNK

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store