Discovering Patterns in Pattern Mining

Freddy Domínguez
7 min readDec 16, 2022

--

How use Emerging Patterns and Skypatterns in real world

The challenge of decentralization: Mining and development, powered by Adobe Firefly

Main reasons to use pattern mining in government agencies is to identify trends and patterns to inform policy decisions and improve the efficiency and effectiveness of government programs and services.

As many of you know Peru's agriculture loses $100 million a day due to latest riots. Government institutions always seek to understand and make use the collected data in order to take actions over different agricultural programs and initiatives.

Our goal for this article will be to identify the most profitable gain for each province in the agriculture sector

About the data

The siembra dataset is from the Peruvian government’s open database, and it contains statistical information on planting intentions that was gathered through a national survey from 2020 to 2021. The fields in the dataset include: the department, province, district, crop, campaign, and the months from August 2020 to July 2021 which contains a number depending the intentions for each farmer.

First 5 items from dataset

First begins installing the following libraries: paretoset package which allows us to find the Pareto front of a set of multi-objective optimization problems, the scikit-plot to plot quickly data visualizations, pyfim for finding frequent item sets (also known as frequent patterns) in transaction data using the apriori algorithm, and finally hypernetx to provide data structures and algorithms for analyzing and manipulating hypergraphs.

from IPython.display import clear_output 
import pandas as pd
import numpy as np
import scikitplot as skplt
import matplotlib.pyplot as plt
from paretoset import paretoset
from fim import *
import plotly.express as px

Sequentially, we read the excel file and rename the PROINICA column to PROVINCIA, and update the column DISTRITO which concats the depart, province and district names as to get an unique id, just because many districts and provinces have the same name, finally storage in new variable.In order to prepare data we remove duplicated rows and keep the DISTRITO and CULTIVO.

df_siembra = pd.read_excel('https://www.datosabiertos.gob.pe/sites/default/files/Estadisticas%20Intension%20de%20Siembra.xlsx')
df_siembra.rename(columns={'PROVINICA':'PROVINCIA'},inplace=True)
df_siembra_freq = df_siembra.copy()
df_siembra_freq['DISTRITO'] = df_siembra_freq['DEPARTAMENTO']+'-'+ df_siembra_freq['PROVINCIA']+'-'+ df_siembra_freq['DISTRITO']To sump up,
df_siembra_freq = df_siembra_freq[['DISTRITO','CULTIVO']]
df_siembra_freq.drop_duplicates(inplace=True)

For the purpose of extracting patterns, we need to transform our frequency dataset into a transactional one, so we create to_transactional function, which iterates over each column specified by column_trans, for each unique value the function appends a list of values in columns_items for all rows, and then returns the list of transactions:

def to_transactional(df,column_trans,columns_items):
transactions = []
for v in df[column_trans].unique():
transactions.append(list(df[df[column_trans] == v][columns_items].values))
return transactions

The fprowth function from fim allows to find frequent item sets in transactional data by using FP-grown algorithm that is efficient building a compact data structure. The function takes a list of transactions, the type of frequent item sets('a' means all, 'c' means closed), a minimum support threshold, and returns a list of frequent item sets that have a support count greater than or equal to the supporting parameter, the last param report has "aS" value indiques returns (a)bsolute item set support and (S) the relative support as a percentage

def all_itemsets(trans_, supp_=-1,target_='a'):
# Perform the fpgrowth algorithm on the given transactions, with the specified target item(s) and minimum support
r = fpgrowth(trans_, target=target_, supp=supp_, report='aS')
# Convert the result (a list of tuples) into a dataframe, with columns 'Items', 'Freq', and 'Freq(%)'
df_items = pd.DataFrame(r)
df_items.columns = ['Items','Freq','Freq(%)']
# Sort the dataframe in descending order by the 'Freq' column
df_items.sort_values(by='Freq',ascending=False,inplace=True)
# Create a new column called 'Itemset' which contains the items in each frequent itemset sorted in alphabetical order
df_items['Itemset'] = [str(sorted(x)) for x in df_items['Items'].tolist()]
# Return the resulting dataframe
return df_items

Let the game begins

We will transform the frequency dataset into transactional. The new dataset has 1503 transactions:

trans = to_transactional(df_siembra_freq, 'DISTRITO', 'CULTIVO')
print(len(trans)) # prints 1503

Imagine the government wants to create an agricultural program for Cajamarca department. we need to analysis transactions from there and other sites, so slit the data into two dataframes:

df_siembra_cajamarca = df_siembra[df_siembra['DEPARTAMENTO'] == 'CAJAMARCA']
trans_cajamarca = to_transactional(df_siembra_cajamarca, 'DISTRITO','CULTIVO')
print(len(trans_cajamarca)) # 126
df_siembra_not_cajamarca = df_siembra[df_siembra['DEPARTAMENTO'] != 'CAJAMARCA']
trans_not_cajamarca = to_transactional(df_siembra_not_cajamarca, 'DISTRITO','CULTIVO')
print(len(trans_not_cajamarca)) # 1291

Now lets create a dataframe with the frequent itemsets, their frequency, and their frequency as a percentage of the total number of transactions, -1 indicates minium support used to select itemsets, and 'c' means the frequent item sets to be found must be closed:

df_c_itemsets = all_itemsets(trans_cajamarca,-1,'c')
Transaction-frequent items performed by fpgrowth algorithm

With the intention of comparing the planting intention of the Cajamarca department with the rest of Peru, the intention ratio will be calculated. To do this, the frequency itemsets outside of Cajamarca will be calculated.

df_nc_itemsets = all_itemsets(trans_not_cajamarca,-1,'c')

Creating emerging dataset

An emerging dataset contains information about items/patterns are becoming more popular over time, many scientists use it to identify opportunities in a particular field.

emerging = df_c_itemsets.join(df_nc_itemsets.set_index('Itemset'),
on='Itemset',lsuffix='_c',rsuffix='_nc',how='outer').fillna(0)
emerging['GrowthRate_c'] = (emerging['Freq(%)_c'] / emerging['Freq(%)_nc'])
emerging.sort_values(by='GrowthRate_c',ascending=False,inplace=True)
emerging.reset_index(inplace=True,drop=True)
emerging['Freq'] = emerging['Freq_c'] + emerging['Freq_nc']
list_items = []
for i in range(emerging.shape[0]):
if emerging.iloc[i,0] != 0:
list_items += [emerging.iloc[i,0]]
elif emerging.iloc[i,4] != 0:
list_items += [emerging.iloc[i,4]]
emerging['Items_c'] = list_items
emerging['Size'] = [len(x) for x in emerging['Items_c'].tolist()]

In the above code, we create an "emerging" dataset:

  1. Join both dataframes by "Itemset" column. The resulting dataframe has two columns for the frequency of each itemsets in the Cajamarca and non-Cajamarca departments, theses are suffixed with '_c' and '_nc', respectively.
  2. Create 'GrowthRate_c' which is the ratio of the frequency of each itemset in the Cajamarca to its frequency outside of Cajamarca
  3. The emerging is sorted in descending order by GrowthRate_c.
  4. The index is reset(setting consecutive integer)
  5. Create 'Freq' column that is the sum of frequencies (C & NC)
  6. Create a list, which is populated by iterating through the rows of the dataframe and checking which of the ‘Items_c’ or ‘Items_nc’ columns is not 0. If the ‘Items_c’ column is not 0, its value is added to the list. If the ‘Items_nc’ column is not 0, its value is added to the list, and is assigned to 'Item_c' column
  7. Create a 'Size' column, contains the number of items in each itemset

The emerging dataframe displays:

Emerging dataset

We can now identify which itemsets are the most distinct and relevant for farmers from Cajamarca department compared to the rest of the country.

Now let see paretoset, which create a boolean mask that have the maximum value in each column should be selected, this mask allows to select rows from emerging, then we create a new column called is_skyitem to mark the skyitems.

“In database systems, skyitems are data queries or subset of data that is not dominated by any other data points in a multi-dimensional space”

mask = paretoset(emerging[['Freq','Size','GrowthRate_c']], sense=['max','max','max'])
sky_itemsets = emerging[mask]
emerging.loc[:,'is_skyitem'] = 'no'
sky_itemsets.loc[:,'is_skyitem'] = 'yes'
print(emerging.shape)
print(sky_itemsets.shape)
df = pd.concat([emerging,sky_itemsets]).reset_index()
fig = px.scatter_3d(df, x='Freq', y='Size', z='GrowthRate_c',color='is_skyitem')
# camera = dict( eye=dict(x=2, y=2, z=0.1) ) # use it in case
# fig.update_layout(scene_camera=camera,) # wants to fix the camera in 3D
fig.show()
Detecting the skyitems by Freq, Size and GrowthRate (1)
Detecting the skyitems by Freq, Size and GrowthRate (2)

A parallel coordinates plot can be used to identify patterns in the data without the need for a 3D plot. To create it, the function parallel_coordinates can be called.

fig = px.parallel_coordinates(df[['Freq','Size','GrowthRate_c']] , color='Size',
labels={'Freq','Size','GrowthRate_c'},
color_continuous_scale = px.colors.diverging.Tealrose)
fig.show()
Parallel coordinates plot with by Freq, Size and GrowthRate (3)

Inspecting the skyitems, we can determining the following assrtions:

Itemsets that are the most relevant & skyitems
  1. The first itemset (Yuca, Maiz choclo, Arveja grano verde, Papa blanca, Frijol grano seco, Arveja grano seco, Maiz amilaceo) is 16.25 times that farmers have a greater intention to sow.
  2. The second itemset (Yuca, Maiz choclo, Arveja grano verde, Papa blanca, Frijol grano seco, Arveja grano seco, Maiz amilaceo) is 15.69 times farmers have a greater intention to sow.

The regional government can offer incentives to new farmers in nearby areas to sow this type of crop and take advantage of fertile lands in the department, thereby increasing their profit at the end of the season.

Additionally, create seasonally food campaigns in homes or open-air markets to keep the market moving and minimize losses due to spoilage.

Conclusion

Data mining helps us to mine frequent patterns and extract transactions from built trees by FP-tree, this algorithm is a good tool that is commonly used in this kind of process.

Discovering patterns and trends allows organizations to extract valuable insights from their data sources, enabling them to make informed decisions, optimize processes, and enhance outcomes.

You can find the colaboratory notebook here.

Thank you for being here, please comment down your views, if any mistakes found the article will be updated

--

--

Freddy Domínguez
Freddy Domínguez

Written by Freddy Domínguez

Peruvian #Software #Engineer CIP 206863, #Business #Intelligence #Data #Science. I work with people to create ideas, deliver outcomes, and drive change

No responses yet