Wednesday, 04 September 2019 17:57

SAP HANA ML Python APIs : Sequential Pattern Mining Algorithm (SPM)

Written by https://blogs.sap.com/2019/09/05/sap-hana-ml-python-apis-sequential-pattern-mining-algorithm-spm/
Rate this item
(0 votes)
Source https://blogs.sap.com/2019/09/05/sap-hana-ml-python-apis-sequential-pattern-mining-algorithm-spm/
“© 2020. SAP SE or an SAP affiliate company. All rights reserved.” “Used with permission of SAP SE”
environment with HANA ML is up and running ,if not please follow the steps mentioned in previous blog post.

Introduction and Algorithm description:

The sequential pattern mining algorithm (SPM) searches for frequent patterns in sequence databases. A sequence database consists of ordered elements or events. For example, a customer first buys bread, then eggs and cheese, and then milk. This forms a sequence consisting of three ordered events. We consider an event or a subsequent event is frequent if its support, which is the number of sequences that contain this event or sub-sequence, is greater than a certain value. This algorithm finds patterns in input sequences satisfying user defined minimum support.

Understand Sequence Pattern Mining before going into practice : 

  • Find all subsets of items that occur with a specific sequence in all other transactions:
    e.g {Playing cricket -> high ECG -> Sweating}.
  • Find all rules that correlate the order of one set of items after that another set of items in the transaction database:
    e.g 72% of users who perform a web search then make a long eye gaze
    over the ads follow that by a successful add-click.

Prerequisites:

  • The input data does not contain null value.
  • There are no duplicated items in each transaction

Let’s Start:

Import Packages

##First, import packages needed in the data loading. from hana_ml import dataframe from data_load_utils import DataSets, Settings

Setup Connection

In our case, the data is loaded into a table called “PAL_APRIORI_TRANS_TBL” in HANA from a csv file “apriori_item_data.csv”. To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information.

***************************

[hana]
url=host-url
user=username
passwd=userpassword
port=3xx15

***************************

Maintain the login information in one config file & have it ready in your root folder

url, port, user, pwd = Settings.load_config("../config/e2edata.ini") connection_context = dataframe.ConnectionContext(url, port,user,pwd) print(connection_context.connection.isconnected())

If connection is successful – “True”

Data-Set:

we will analyze the store data for frequent pattern mining ,this is the sample data which is available on SAP’s help webpage.

Attribute Information:

  1.  CUSTID –   Customer ID
  2.  TRANSID – Transaction ID
  3.  ITEMS – Item of Transaction

Load Data for SPM:

Check if table already exists in your schema Select * from PAL_SPM_DATA_TBL

data_tbl = DataSets.load_spm_data(connection_context)

The function DataSets.load_spm_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an example of return message is shown below:

ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, ‘invalid table name: Could not find table/view

PAL_SPM_DATA_TBL in schema DM_PAL: line 1 col 37 (at pos 36)’)
Table PAL_SPM_DATA_TBL doesn’t exist in schema DM_PAL

Creating table PAL_SPM_DATA_TBL in schema DM_PAL ….
Drop unsuccessful

Creating table DM_PAL.PAL_SPM_DATA_TBL

Data Loaded:100%

#####################

if data is already loaded into HANA:

data_tbl = DataSets.load_spm_data(connection_context) print("Table Name is: " +str(data_tbl)) import pandas as pd
Table PAL_SPM_DATA_TBL exists and data exists Table Name is: PAL_SPM_DATA_TBL

Create dataframes using Pandas Dataframes for data load from SAP HANA

##Create a dataframe df from PAL_SPM_TRANS_TBL for the following steps. df_spm = pd.DataFrame(columns=['CUSTID' , 'TRANSID' , 'ITEMS']) df_spm = dataframe.create_dataframe_from_pandas(connection_context=connection_context, pandas_df=df_spm, table_name=data_tbl, force=False, replace=True)

Re-execute the below command and check what message it returns

data_tbl = DataSets.load_spm_data(connection_context) print("Table Name is: " +str(data_tbl))
Creating table DM_PAL.PAL_SPM_DATA_TBL Data Loaded:100% Table Name is: PAL_SPM_DATA_TBL 
df = df_spm

Display dataframe records

df.collect().head(100) ##Display Data

df.dropna() ##Drop NAN if any of the blank record is present in your dataset
print("Toal Number of Records : " + str(df.count()))
print("Columns:") df.columns

Data Operations:

Get insights of dataset using the following few operations , you can play more by applying few more functions to get more insights.

Filter Data:

Fetch all customers who has CUSTID = ‘A’

df.filter("CUSTID = 'A'").head(10).collect()

Fetch all customers having transaction id = 1

df.filter('TRANSID = 1').head(10).collect()

Fetch all customers & transaction having item id = ‘Apple’

df.filter("ITEMS = 'Apple'").head(10).collect()

Group by on Data:

GroupBy ITEMS

df.agg([('count' , 'ITEMS' , 'TOTAL TRANSACTIONS')] , group_by='ITEMS').head(100).collect()

GroupBy CUSTID

df.agg([('count' , 'CUSTID', 'TOTAL TRANSACTIONS')] , group_by='CUSTID').head(100).collect()

GroupBy TRANSID

df.agg([('count' , 'TRANSID', 'TOTAL TRANSACTIONS')] , group_by='TRANSID').head(100).collect()

Display the most popular items:

Import matplotlib library to display the plot having the most popular items in our dataset

import matplotlib.pyplot as plt from wordcloud import WordCloud plt.rcParams['figure.figsize'] = (10, 10) wordcloud = WordCloud(background_color = 'white', width = 500, height = 500, max_words = 120).generate(str(df_spm.head(100).collect())) plt.imshow(wordcloud) plt.axis('off') plt.title('Most Popular Items',fontsize = 10) plt.show()

The most popular items from plot are Apple & Blueberry and if we try to find the count of transactions for both of them :

Count of Apple – df.filter(“ITEMS = ‘Apple'”).head(100).count() = 10

Count of Blueberry – df.filter(“ITEMS = ‘Blueberry'”).head(100).count() = 10

Import SPM Method from HANA ML Library:

from hana_ml.algorithms.pal.association import SPM

Setup SPM instance:

sp = SPM(conn_context=connection_context, min_support=0.5, relational=False, ubiquitous=1.0, max_len=10, min_len=1, calc_lift=True)

Method

fit(datacustomer=Nonetransaction=Noneitem=Noneitem_restrict=Nonemin_gap=None)

Sequential pattern mining from input data.Execute the method

Training of model with dataset.

sp.fit(data=df_spm, customer='CUSTID', transaction='TRANSID', item='ITEMS')

Collect the result

sp.result_.collect()

Result Analysis:

  • Item-set Apple has support 1.0 indicates the frequency of the item in all the transactions , most frequent item – confidence & lift is 0 for all the single items which states there is no antecedent & consequent item of them
  • Consider (Apple , Blueberry): Support is .88 (Frequency of these items together is 88%) , Confidence is 88% means if someone is buying Apple then 88% chances they will also have blueberry in their bucket , lift is .89 close to 1 indicates high Association among items purchase.
  • Result displays most of the items are having support , Confidence & Lift more than 70% indicates the mining of highly frequent items & can be considered strong association rules.

PAL Reading:

SAP HANA Predictive Analysis Library (PAL)

Note – I will be attaching python notebook and dataset soon stay tuned.

Feedback/Suggestions/Questions are most welcome ?

Thanks,
Shivam

Read 339 times

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.