Stock Analysis with PySpark

4 min readAug 31, 2020

A short stock analyses using PySpark. The pyspark.sql method will be used for illustrating how to use the SQL language in PySpark.

Here a stock analysis is performed for showing how Spark is a powerful technology. The analyses of interest are based on the following data sets/files.

Prices.txt:
* Prices.txt is a text file containing the historical information about the prices of stocks on several financial markets
* The sampling rate is 5 minutes (i.e., every 5 minutes the system collects the prices of the stocks under analyses and a new line for each stock is inserted in Prices.txt)
* Each line of the input textfile has the following format:
stockID, date, hour:minute, price

The tasks will be explained in a while. Let’s set up the environmental. First of all we have to initialise a spark context object:

A SparkContext object is a bridge between the application and the Cluster. It is used to create RDD, accumulators and broadcast variables on that Cluster. The SparkSession object allows to work with DataFrames (it is the DataFrame entry point).

Now, let’s read the dataset and create a DataFrame with the input data.

inputPath* is the string containing the path name useful for reaching the dataset. It will be useful to have a new column with just the year of the recording stocks, in order to filter out rows by years. The datetime python library helps us. Since we have the date column containing the data in the format yyyy/mm/dd we can use the strftime method to convert the input date to the wanted output date: from “%Y/%m/d” to “%Y”. Wonderful! This is a case in which the user needs to use a particular function on a Spark DataFrame. In order to apply a user defined function to a DataFrame, such function has to be register. This is done using the User Defined Function (UDF) method of Spark. We assigned a name to the function “toYear” and the definition. Now, we can create a new DataFrame starting from the initial df, containing the column: “id”, “date”, “year”, “hour”, “price”. Where the new column “year” is obtained applying to each row of the date column the UDF registered function.

The first task we want to do is about selecting Stock frequently characterized by a daily price greater than 10 euros. In particular, only for the historical data of year 2016, the application must compute for each stock in how many distinct dates its price was greater than 10 euros. The application selects and stores in an HDFS folder only those stocks associated with at least 5 distinct dates associated with a price greater than 10 euros. The output contains one line for each of the selected stocks. Specifically, each output line contains a stockid and the number of distinct dates with a price greater than 10 euros associated with that stock (e.g., FCAU,34). The DataFrame dfFiltered contains only stocks collected in 2016 with a price higher than 10 euros. So, it is enough to select only id and data, then groupBy id and counting how many rows there are. Finally, it is enough to filter out rows with a count less or equal than five.

Now, let’s suppose that the finance company is also interested in identifying the stocks that are frequently characterized by a “positive weekly trend”. It means, the company is asking for getting all stocks for which the price of the last day of a week is greater than the price at the beginning of the same week. In particular, the company wants to select only the stocks for which this particular event happens at least a number equal to NW.

Specifically, given a week of the year and a stock, the weekly trend of the stock in that week is classified as a “positive weekly trend” if the difference between the highest stock price of the last day of the week and the highest stock price of the first day of the week is greater than 0. The application must select those stocks that are characterized by at least NW “positive weekly trends” (i.e., at least NW weeks with a positive trend for each of the selected stocks). NW is an integer number and is a parameter of the application. The analysis is based on the historical data stored in Prices.txt, considering only year 2016. Don’t worry we all know how crazy finance markets are.

The stock FCAU has got three positive weekly trend, while GOOG two.

Stock Analysis with PySpark

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Gianp

No responses yet