File Name: rapidminer data mining use cases and business analytics applications .zip
Download PDF. A short summary of this paper. IntroductionA bank introduces a new financial product: a type of current checking account with certain fees and interest rates that differ from those in other types of current accounts offered by the same bank. Sometime after the product is released to the market, a number of customers have opened accounts of the new type, but many others have not yet done so.
The bank's marketing department wants to push sales of the new account by sending direct mail to customers who have not yet opted for it. However, in order not to waste efforts on customers who are unlikely to buy, they would like to address only those 20 percent of customers with the highest affinity for the new product. This chapter will explain how to address the business task sketched above using data mining. We will follow a simple methodology: the Cross-Industry Standard Process for data mining CRISP-DM , whose six stages will be applied to our task in the following subsections in their natural order although you would typically switch back and forth between the stages when developing your own application.
We will walk step by step through the fictitious sample data, which is based on real data structures in a standard data warehouse design, and the RapidMiner processes provided with this chapter to explain our solution. Thus, how can we determine whether a customer has a high affinity for our new product? We can only use an indirect way of reasoning. We assume those customers who have already bought the product the buyers to be representative of those who have a high affinity toward the product.
Therefore, we search for customers who have not yet bought it the non-buyers but who are similar to the buyers in other respects. Our hope is that the more similar they are, the higher their affinity. Our main challenge, therefore, is to identify customer properties that can help us to find similarities and that are available in the bank's data. Section 7. Assuming that we have good data, we can use a standard data mining method, namely binary classification, to try to differentiate between buyers and non-buyers.
Trying to keep buyers and non-buyers apart in order to find their similarities may sound paradoxical; however, "difference" and "similarity" belong to the same scale. Therefore, it is crucial that our data mining algorithm be able to provide that scale. Fortunately, most algorithms can do that by delivering a ranking of customers, with higher-ranked customers being predicted to be buyers with higher confidence or probability than lower-ranked ones.
Thus, in what follows, a number of mining models will be developed that each deliver a ranking of non-buyers in which the top-ranked customers are those for which the model is most confident that they ought, in fact, to be buyers if only they knew it! We will also see how to decide which model is most useful. We can then serve the marketing department by delivering the top 20 percent of non-buyers from our final ranking.
Finally, in the last part of this chapter, we will discuss how to apply the same methodology to other business tasks. Business UnderstandingThe purpose of the CRISP-DM Business Understanding phase is to thoroughly understand the task at hand from the perspective of the business users: what terminology do they use, what do they want to achieve, and how can we assess after the project whether its goals have been met?
CH04 is the new type that is to be pushed by our marketing action. Basically, each type of account comes with certain fixed monthly fees and interest rates for credit and debit amounts, but some customers can have deviating rates or can be freed from the monthly fee due to VIP status or other particularities. When an account ends, its balance is zero and no further money transaction can occur. An open account whose ending date has not yet passed is called active and a customer with at least one active account is also called active.
There are many categories such as "cash withdrawal", "salary", "insurance premium", and so on, including an "unknown" category. While data from these business branches is a valuable source of information, in our simplified example application, we do not include such data.
From this we quickly develop the idea of using information about the customers' behavior, derived from the money transaction data, to characterize our customers. We delve into this in Section 7. For now we decide to exclude inactive customers from the analysis because their behavior data might be out of date.
Thus, all active customers who have ever had a CH04 account are buyers, and all other active customers are non-buyers. In order to evaluate whether the mailing was successful, we can look at the sales rates of the CH04 account among recipients of the mail and other current non-buyers some time after the mailing. While this is an important part of our project, we will not discuss it further in this chapter as it involves no data mining-related techniques.
Data UnderstandingContinuing our interviews with the bank's staff, we now turn to the IT department to learn about the available data. This phase of the CRISP-DM process is central to planning the project details, as we should never mine data whose structure and contents we have not fully understood.
Otherwise, we run the risk of misinterpreting the results. Not surprisingly, the bank has implemented a central data warehouse DWH , that is, a data source that is separate from the operational information systems and provides an integrated view of their data, built specifically for analysis purposes. While data warehouses are not typically aimed at allowing data mining directly, data mining projects can benefit greatly from a well-designed data warehouse, because a lot of issues concerning data quality and integration need to be solved for both.
From the perspective of a data miner, data warehouses can be seen as an intermediate step on the way from heterogeneous operational data to a single, integrated analysis table as required for data mining.
It is best practice  to use a dimensional design for data warehouses, in which a number of central fact tables collect measurements, such as number of articles sold or temperature, that are given context by dimension tables, such as point of sale or calendar date. Fact tables are large because many measurements need to be collected, so they do not store any context information other than the required reference to the dimension tables.
This is the well-known star schema design. We consider a particular star schema, depicted in Figure 7. The central fact table holds money transactions, linked to three dimensions: calendar date, customer, and account.
Includes personal data: first and last name, date of birth as a reference to the Dim CalendarDates table , occupation one of ten internal codes, 00 through 09, whose interpretation is irrelevant for our application , sex M or F , family status single, married, or divorced, with a few missing values , and income.
The income attribute is set to zero for young customers, but has many missing values for adults. Each account belongs to one of the types CH01 through CH04 and has a start date and an end date both are references to the Dim CalendarDates table. The monthly fee, the overdraft limit, and the interest rates for credit and debit amounts are stored individually with each account because, as we saw in the previous section, for some customers individual fees or rates may apply.
In our case, it holds one row for each of the 50, days from January 1st, through November 22nd, , plus one row with ID , that represents "infinity" December 31st, This last row is used for the end date of accounts whose end date is not yet fixed active accounts. Because the Date ID attribute numbers the days consecutively, we can use it to compute the number of days between two given dates directly without needing to join to this table, as long as we do not use the special date "infinity".
The columns of this table can be used in reporting or other applications to qualify a given calendar date; we won't need most of them for our use case. A reference to the account owner customer and the transaction date is stored with each fact transaction. The type of transaction holds categories like "salary" that are found automatically as explained in Section 7.
For predictive data mining projects, such a mechanism can be valuable because it may be necessary to restore data values from past points in time, in order to learn from relationships between those values and subsequent developments. This is not needed for the application discussed here, however. In addition, a customer may not represent a natural person but a married couple or a corporate entity. These issues would need to be considered to find an appropriate definition of a mining example in a real application.
However, our sample data includes only natural persons. These simplifications allow us to focus on the data mining issues of our task, yet still include some standard data preparation subtasks and a RapidMiner solution to them in the following section. You should easily be able to extend the methods that we discuss in this chapter to your real data.
Data PreparationVirtually all data mining algorithms use a single table with integrated and cleaned data as input. The table should provide as much information about the examples here, the customers as is feasible, so that the modelling algorithm can choose what is relevant. Therefore, it is a good idea to bring background knowledge about the business into the data. In this section we examine two RapidMiner processes; one to assemble the data from our warehouse schema into a single example set Section 7.
Assembling the DataBringing data from the warehouse into a form suitable for data mining is a standard task that can be solved in many ways. Let us take a look at how to do it with RapidMiner, as exemplified by the first sample process from the material for this chapter, Ch7 01 Create-MiningTable.
After opening the process in RapidMiner, you can click on Process, Operator Execution Order, Order Execution to display the order of operator execution; we will use these numbers for reference in the following. Each of our four tables is used as one of the four inputs for the process. You could replace the four Retrieve operators numbers 1, 3, 8, and 9 that read our sample data with operators like Read Database if you were using a real warehouse as your source.
Each input is fed into a substream of our process consisting of a number of operators, and we will discuss the substreams in turn below.
Figure 7. One area of consideration in this phase is data types. RapidMiner associates each attribute with one of several types: real, integer, date, binominal, or nominal.
The attributes in our four source datasets have an automatically determined type which may or may not be appropriate; ensuring that our resulting example set is correctly typed makes it easier for RapidMiner to support our data mining issues later on.
Substream: customer data: For example, the first operator we apply to our customer data number 2 in the execution order, compare Figure 7.
The operator Nominal to Binominal, which we use here, normally creates new, binominal attributes for each value of the input attribute, but in this case it recognizes that the input attribute already is binominal, so it just changes the type. We have set the parameter attribute filter type to single because we only want to change one attribute. Next, we want to include a customer's current age into our data mining basis. In our application we assume that the data is current for December 31st, , so we compute the current age as of this date.
Because the customer table refers to the calendar date table in the birth date attribute, rather than using a date attribute, we join the calendar date table to the customer data using the Join operator number 5 with the join type "left". Our example sets have no attribute with the RapidMiner ID role yet, so we uncheck the option use id attribute as key in the Join operator; instead we choose the join key attributes explicitly.
Note that in order to avoid taking over all the attributes from Dim Calendar-Dates, we use the Select Attributes operator number 4 before the join. After the join, we can compute a customer's current age operator 6 by subtracting their year of birth from year is an attribute from the calendar data, so we set up the simple function expression -year as a definition for the new attribute current age in the Generate Attributes operator. Finally, we remove three attributes from the customer data 7 , as they are useless for data mining because they nearly take on individual values for every customer: first and last name, and the birth date ID.
Substream: account data: Now we turn to the account data, see Figures 7. One important goal of this process is to label each customer as buyer or non-buyer; all owners of a CH04 account as of December 31st, are buyers. A customer can have more than one account, even more than one CH04 account, so we group by customer ID with the Aggregate operator The resulting example set has only one attribute, the customer ID; we do not need any aggregation attributes since all customers in this substream are buyers.
The system can't perform the operation now. Try again later. Citations per year. Duplicate citations. The following articles are merged in Scholar. Their combined citations are counted only for the first article.
To browse Academia. Skip to main content.
Search this site. Abandoned Hospitals PDF. Abiding Faith PDF.
Farced Akhtar 3. Fareed Akhtar 4. Fareed Akhtar 5. Fareed Akhtar 6. Affinity-Based Marketing 77 Euler Timm 7. North 8.
These application-oriented chapters give you not only the necessary analytics to solve problems and tasks, but also reproducible, step-by-step descriptions of.
Your email address will not be published. Required fields are marked *