Dataframes

So the first thing that you will come across when you start learning apache spark is, dataframes.

Dataframes in apache spark are similar to tables in a RDBMS, just better optimized. Dataframes are table like structures with named columns and they are used in R and python’s pandas library.

Lets see how it works.

I am using the file present here for demonstration.
To start apache spark engine using python API’s, go to the bin directory of apache spark installation. and run the command

.\pyspark

Something like this should show up in command prompt.

To import the data in the file, use the command.

>> employee_dataframe = spark.read.csv("D:/apache-course/employees.csv", header=True)

The parameter, “header=True” tells spark that column names are present in the file.
The command will give no output, and the file will be loaded to spark memory.

To see the dataframe created. Use command

>> employee_dataframe

The above command will show the structure of dataframe just created.

Other ways to see the structure are:

employee_dataframe.schema

To get a slightly more readable output use

employee_dataframe.printSchema()

To get a random sample set of data, use

sample = employee_dataframe.sample(False,0.1)

the first argument, False, tells spark that we want a sampling without replacement. And, the second argument tells spark that we want a sample of approximately 10% of total data.

We also have have a filter function to filter the records based on a given condition.

For a complete set of functions available for dataframes, see this page.

Thank you for reading this article. Many more are on their way so stay tuned.

Advertisements Rate this:Share this:
Like this:Like Loading... Related