Data Searching and Data Mining
This is an Insight article, written by a selected partner as part of GIR's co-published content. Read more on Insight
As technology evolves, so does terminology. ‘Data mining’ is a word originating from the 1990s. It was followed by ‘big data’ and now both terms are grouped within the field of ‘data science’. The concept of ‘data mining’ remains the same: to use hardware, programming systems and algorithms to solve problems. This chapter will refer to the phrase ‘data mining’ when describing these concepts.
More specifically, data mining is the process of finding anomalies, patterns, and correlations within large datasets stored in repositories. Data mining can consist of a simple document search, but it can also be more complex, employing pattern recognition technologies and the use of statistical and mathematical techniques.
There are five steps in the data mining process:
- data collection;
- understanding of data properties;
- preparation of data to clean, construct and format it appropriately;
- setting up a data model to identify the relationships between the various pieces of information in the dataset; and
- results evaluation.
This chapter will first discuss the use of data mining in the context of a fraud investigation. Next, it will draw an important distinction between structured and unstructured data. This chapter will then outline the techniques which should be used to build a data strategy to:
- identify the data that is required;
- prepare the data for analysis; and
- analyse the data.
Finally, this chapter will explain how to ensure that any digital evidence gathered is both legally and procedurally compliant.
The use of data in a fraud investigation
Not all investigations need to incorporate data mining in order to be resolved, but it is increasingly required given the digitalisation of processes, including the use of electronic devices to communicate. Obtaining data and analysing it is a core part of the investigation and part of the investigators’ strategy and any data mining will require the investigator’s skills to make hypotheses and define potential scenarios and a strategy. Without the input from an investigator, it would not be possible to set up an algorithm to process the data effectively. However, to get the best results from a dataset, the investigator will need to understand the potential of the data to define their approach.
In this section we will first consider the benefits of data mining and processing in the context of an investigation. We will then discuss what strategies can be utilised by investigators. Finally, we will consider the types of data sources available.
Benefits of using data searching and data mining
Whilst often used for marketing or scientific purposes data mining and data searching techniques can be very useful for fraud investigations. It is important to ensure that techniques are adapted to suit the circumstances of the case.
The first objective should be to gather useful information for the investigation. Through the processing of data such as transactions and emails, you should be able to understand key features of a company’s activities and employees’ roles, to build a picture of:
- when operations are being processed, by looking at the date, time and frequency of the processing;
- who is doing what, by looking at the users who are processing the transactions or sending the emails;
- how information is processed, by checking the flows of data in the system and their interactions;
- what volumes of information are processed, by analysing the volumes of transactions; and
- with whom people are in contact, by assessing the interaction of the operations and the emails.
The second objective should be to collect outputs that can be used as both evidence and clues leading to further avenues of enquiry. The data mining process can produce results that will help investigators progress in their analysis by:
- checking information obtained from a testimony, e.g. checking whether the described control is really in place and there is no gap;
- checking the integrity and completeness of Key Performance Indicators (KPIs) and reporting by re-computing them with data you can trust;
- finding trends and clues showing something irregular is happening, allowing the investigator to narrow down area of focus;
- obtaining factual evidence such as an email, a video or a text document showing a misdemeanour or a faulty entry in the system; and
- finding a lead to physical evidence: for instance, the analysis may give a series of invoice references that will enable you to match with physical proofs such as signed documents.
The third objective should be to test the efficacy of your investigation and any further steps needed. Examples:
- you can test the whole period you are covering and make sure you did not miss anything, if you have the right scenario;
- you can also aggregate all your results to compute an overall impact of the fraud to check global materiality and make decision on the next steps of the investigation; and
- you can track what you have done with the scripts of your analysis so that you can document where your results are coming from and provide this documentation as part of the evidence in case of a prosecution, see section ‘Implications of digital results as legal evidence’ on page 23.
Data mining can be very effective, but it will depend also on your strategy. You will need to include this data mining process in your usual investigation methods but also adapt it as necessary to make the most of it. You also need to adapt to different situations, as all fraud cases are different.
Types of data in fraud cases
Before planning for data search the investigation team should consider which types of data are potentially relevant and where the data is actually stored.
The forensic investigation needs to search into different data types that can include unstructured and structured data as well as consider the source from where data will be acquired. Not only is this aspect important to ensure that the acquisition of data is done properly, without compromising the integrity of the data, but also because the source of data can serve as a criterion for an initial triage between structured and unstructured data.
Diagram 1: Data types and sources
Prior to processing data, it can be useful to gather general information about the data acquired such as:
data retrieval information such as system or server from where data was copied including how the copy was performed. This information is to be considered at the time of the collection of data, see Chapter 9 ‘Data Collection’ and will be indicative of the best tools for processing;
- separate data that needs to be processed and exclude data that is not needed; and
- for unstructured data, ask the organisation, e.g. CIO, to collect information about the type of data stored as well as technical information about support where data are stored.
The above mentioned information will help the investigation team to assess whether the data acquired is structured or unstructured and choose the most appropriate tools for processing the data.
The options available for processing of data mainly depend on whether data is structured or unstructured.
Unstructured data lacks a pre-defined model and structure and can present information in varied ways. Typical examples of unstructured data include the following:
- Text documents, e.g. Word and PDF files;
- Video, picture, and audio files; and
- Presentation documents, e.g. PowerPoint.
Even files belonging to the category of unstructured data, such as e-mails or PDF files, may contain structured information. Daily bank e-mail statements, for example, contain structured information that could be analysed by means of algorithms.
Unstructured data has gained a lot of importance in the context of investigations. International Data Corporation (IDC) published a White Paper that estimates that over 80% of data globally will be unstructured by 2025.
As a matter of fact, the volume of data that investigators need to manage today is already challenging enough. In the case of unstructured data the volume challenge is compounded by the variety of data which makes its handling much more complex, not only does unstructured data include different file formats, e.g. textual versus video but also different content, e.g. mere business transactions versus business secrets.
Choosing a strategy based on the circumstances of your case
When you start an investigation the level of information you may have could be very different depending on your starting point, you may know who is suspected, what happened, where and when it happened, or it could be that you do not have much information.
If you know what you are looking for, subject, scheme etc. and want to use data searching and data mining, you can start understanding processes and systems in place. Next you can define what the scenario is or what patterns were used by the subject and look at where you can identify the activity in the system. You can then collect traces of the scheme used by the subject. This will help you in the next phase to prepare your model and source your data.
If you do not know anything precisely, you will have to do a fraud risk assessment of what could have happened and identify if there are any gaps that could have been exploited in the system. You will need to define what hypothesis you will need to test. This can be complex and ideally you should review a whole process based on one assumption, but it may be that you find a gap which you may wish to use as the starting point for your tests, or which causes your hypothesis to evolve. Remember that if there is a robust control in place, no fraud should be possible and a data analysis would not be needed. Data will be useful only if there is a gap in controls or a failure in a set control.
For instance, you may have a control over invoices approval in place that is very strict and there is a segregation of duties control in place. Nevertheless, you may identify that the threshold set up in the control is very high and could therefore be abused if users broke down the invoice values to be under the threshold. This is a gap you can test. Also you need to test whether people tried to dodge this control using the near threshold scheme.
When you know what you should do, who, what, how, why and when, you can start build your strategy to find leads and evidence. Artificial intelligence concepts, machine learning for instance can seem a bit hard to understand when presented by data scientists, but they are testing strategies that can be used by investigators to achieve results. Artificial intelligence cannot exist without strategies defined and set up by humans. When you understand that then you are ready to work with data science.
Testing data can seem complex, but it can be summarised into five categories of testing strategies that follow a different logic:
|Search your data base with key features that can be contained in documents
|Compare data with what is expected by your policies, procedures, or contracts
|Compare data with what is consistent in your organisation based on your knowledge and understanding of it.
Compare data with other information from other sources of information that you can rely on.
Match your data with reality.
|Compare data with what you believe is unusual in terms of results or behaviour
These strategies can be used individually or combined where necessary. Strategies will also dynamically change based on the case evolvement. From this strategy phase you will be able to identify what kind of data you would need to search and to define a search scope; period, areas to cover, types of data you need, sources and custodians. Also, for the analysis you are planning to do, you will need to consider how to link information and how your data set can be enriched. Data analytics procedures provide meaningful results only if data structure is properly mapped.
Once you have defined your tests, you can turn to your IT team to assess what data to extract and how to process it. We recommend for confidentiality reasons that you should be able to manage this yourself, but if that would be too complex, you will need to involve a data engineer who can do the specific data processing. The different tools that can be used based on your objectives and the types of data you want to use are covered in ‘Tools for structured data’ below. We will discuss how to build this dataset in the sections ‘Working with unstructured data’ below and ‘Working with structured data’ below.
Working with unstructured data
To work with unstructured data, you need to prepare your data to make your analysis easier.
Data preparation for unstructured data
Unstructured data can have different degree of complexity, from a plain text contained in a notepad file to more complex containers. More importantly, quantitative analysis can be applied to unstructured data only to a limited extent and most of the information can be obtained only through qualitative analysis.
With digital data growing in an exponential manner, the investigator needs to manage situations where data sits in more complex containers, like cloud storage containers. To add more complexity, there might be a number of different file formats, not all of them readable with commonly used software e.g. bespoke software applications, and different encryption methods.
Image 2 complexity of unstructured data (source: www.nuix.com)
Handling bulk datasets involves several activities: mapping documents from different sources and custodians, tagging documents based on their relevance, distinguishing documents that are duplicates from those that are near duplicates. Another issue is that the nature of the information in the data cannot be known in advance, although it requires to be planned. For instance, you will need to consider at the outset how you will ensure compliance with applicable GPDR provisions when processing e-mail files or personal folders, and ensuring how to manage protection of business secrets and legally privileged documents.
One solution would be to use an e-discovery platform, like Nuix, which offer Technology Assisted Review (TAR) designed to find and understand data in unstructured bulk datasets. E-Discovery platforms have algorithms to cluster data, create family trees, pin point confidential information, intellectual property, contracts, and other sensitive data types.
When moving on to processing unstructured data, the following steps are relevant to consider:
- Define the scope of data sources
- Restore deleted files
- Unwrap containers
- Convert to text (OCR)
- Extract metadata
Analysis of unstructured data
The analysis of unstructured data must be tackled in a different way to structured data. The key differences between unstructured and structured data are set out in the following table:
|Data type and format is generally known in advance
|Data type and format is generally unknown
|Data is mainly analysed by using quantitative methods, e.g. sums, averages, etc.
|Data is mainly analysed by using qualitative methods, i.e. documents responding to certain features.
|Except for input errors, each record can be considered as unique
|Lots of duplicates are to be expected
|Data is usually grouped based on the system source, e.g. totals by G/L account.
|Documents are grouped in threads, e.g. a conversation thread.
|Key-word search is applied in limited cases, i.e. searching of key terms in transaction description.
|Key-word search is applied extensively
E-discovery platforms can help the investigation team to obtain information from bulk datasets:
- general statistics about the datasets, e.g. number of processed files, raw file extensions, and irregular files;
- document kind, e.g. e-mail, text files, spreadsheets, presentations, system files, multimedia, and contact files;
- document codification, e.g. relevant, non-relevant, privileged, business secret,
- custodian of the document;
- duplicates, near duplicates; and
- conversation threads, e.g. e-mail threads.
E-discovery platforms are a preferred tool as they offer a sound approach to data processing and review. This is important when considering that the investigation may be subject to scrutiny, e.g. from regulatory or prosecuting agencies.
In 2003, the Federal Energy Regulatory Commission (FERC) published more than 1.6 million e-mails exchanged by 158 executives involved in the Enron scandal. Almost two decades later professionals who are experienced in high profile corporate investigations are well aware that the volume of data that needs to be processed and analysed has exponentially increased. Further complicating things, regulators have raised the bar of expectations in relation to the standard of corporate investigations.
The question; How to deal with a huge amount of data? cannot be tackled in quantitative terms. It is rather the quality of the review workflow that determines the defensibility of an investigation. Further, there are circumstances where an early case assessment may be required on a huge amount of data, in order to quickly identify the facts of the subject matter under investigation.
The review process, aided by e-discovery platforms, can meaningfully reduce the quantity of documents to be manually reviewed without compromising on the quality of the review. This is achieved in the following ways:
Keyword search is the most traditional and widely used approach of analysing investigation data.
Keyword search can search for synonyms, common spelling mistakes, potential mistyping, acronyms, reduced wording or use of words in a foreign language or slang. This is what we could define as a ‘fuzzy’ search. If you are looking in a structured dataset lists, you can make multiple selections rather than a series of individual selections. To improve your search by inserting additional information that will enrich the criteria you can use. The success of your search with key words will also rely on your preparation of your investigation as you will try to find everything that could be linked to your case such as tasks, tools, places, persons, document names, brands.
There are several e-discovery products Intella by Vound is an example of an e-discovery product suitable to an in-house investigation. More complex platforms are Relativity and Nuix. These offer more powerful keyword search function as follows:
- uploading word lists that can be used as a filter against the dataset;
- identifying duplicates and near duplicates by running automatically hash MD5 check and w-shingling methods. This can save a lot of time to the review team;
- producing reports with the search hits and exporting results;
- keeping track of the review process;
- documents tagging; and
- based on the responsiveness to certain criteria, e.g. responsiveness to certain keywords or time period, documents can be prioritised for review.
The main advantage of the keyword search lies in its simplicity. You just need to think of a keyword and search for it. Determining the ‘right’ list of keywords can however be a challenging task with the risk of using both over-inclusive and under-inclusive searches. While over inclusiveness implies excessive costs, as it increases the number of false positives requiring manual review, under-inclusiveness exposes the investigation to the risk of not detecting relevant documents.
Along with keyword searches, data can be tagged on the basis of other criteria like item type. The following are a few common examples of criteria that can be applied:
- file type, e.g. audio, video, or text;
- document properties, also called document metadata, e.g. author, document length, timestamp, size, path names;
- for e-mail documents, e.g. an item can be identified based on the sender, the people in copy, time sent, subject line, and
- version of a document, that allows to distinguish an original from an amended version.
E-discovery platforms are able to import document metadata and create their own.
Once the investigation team set the strategy, e-discovery platforms can automatically tag the bulk of documents during the processing phase to serve as a starting point for the investigation team who will then go on to commence with a manual review.
Instead of searching for specific keywords, concept searches can be applied instead. E-discovery platforms like Brainspace offer a concept search function, which allows inferences to be drawn from multiple concepts. This has the advantage that, even if a specific key term is not applied, words with similar meanings can be isolated in response to a single, conceptual search.
Social network analysis
Certain data, e.g. e-mail files, provide information about social interactions between people and organisations. Accordingly, for e-mails, data analysis will be focused more on establishing time range and frequency of relevant interactions. Analysis of all relevant interactions may be particularly important when seeking to establish the role of middlemen between two main counterparties for example.
Certain e-discovery platforms, e.g. Brainspace, offer a predictive model that can be used to rank documents and expedite the review.
Predictive ranking works on the basis of documents that are manually tagged as relevant and can be used to ‘train’ the e-discovery machine. The machine will digest the documents and create a predictive ranking that will be used to organise the documents in review batches. As new, relevant documents are identified, the machine learning will continue in an iterative way, providing new predictive ranks and new review batches.
The predictive model is particularly useful in situations where:
- the number of items is too high to be manually reviewed;
- keyword search cannot be applied effectively, for example because would give an excessive high number of false positives; and
- early data analysis is required to form an early assessment of the case.
Working with structured data
Data science is a multi-dimensional field which uses a variety of scientific methods, including but not limited to, the collection, preparation, aggregation and manipulation of data for the purpose of data analysis. Data mining tools are software applications that assist the discovery of patterns, trends and grouping among large sets of data, which can transform data into more refined information such as exception reports or charts. The use of data mining tools and subsequent categorisation of data, helps you to perform different types of data mining analysis.
Tools for structured data
By definition, data is structured when it is categorised and formatted in a pre-defined way, so that analytical and processing mechanisms can be applied to its components. Microsoft Excel is a useful data visualisation and analysis software which uses spreadsheets to organise, store and track data sets through formulas and functions.
The software will help you:
- Extract the data when it is directly linked to the sources
- Prepare the data
- Classify the information
- Process and display data
- Conduct mathematic analysis with built-in formulas
Why use specific tools and what are the key features of these tools
Software like Microsoft Excel cannot read more than 1,0848,576 rows per sheet, therefore for large quantities of data alternative software will need to be deployed. Furthermore, during the data mining process, you will need to join sources of data together. Excel only allows you to join two sources of data, using the ‘vlook’ formula. Other more specialised tools, such as those listed and discussed later in this section will enable you to use more criteria to connect the data.
Structured data is based on a pre-defined data model and is recorded and arranged methodically, meaning virtually 100% of the data can be processed and analysed.
Computer-Assisted Audit Techniques (CAATs) can be applied to identify the ‘automated audit techniques, such as Generalized Audit Software (GAS), test data generators, computerized audit programs and specialized audit utilities’ that can be used in performing various testing procedures on a mass of structured data. One of the greatest benefits of CAATs is that the same testing procedure is repeatable, with obvious advantages in terms of efficiency and effectiveness.
CAATs have been used for decades now with several tools available on the market. These tools range from software with a typical windows interface, such as IDEA and ACL, to programming tools such as MKS Toolkit and more sophisticated extractors and data analysis tools designed for specific ERP systems, i.e. SAP, Microsoft Dynamics NAV.
There are many different types of data mining software available, with some being more user-friendly and others requiring an understanding of coding and specific coding language. In order to use more advanced data mining software, you will require specific coding instructions. These instructions are often free of charge, due to the software being open-source. Currently, many programmes are using the R or Python coding language which, whilst widely used, require specific training. There are also libraries of ready-made codes, called pandas for Python, which can be applied to specific tasks. Software can be used to define your data set and you can export results in a visualisation tool in order to facilitate visual analysis.
There is alternative software available, e.g. IDEA, ACL, which does not require use of specific coding language, as you can define what you want to do in simple language in a script. Other software sometimes uses a visual interface where you can drag and drop your instructions. Training will be needed, but it is not as technical as the coding training programmes. These tools will have some ready to use functions that will help you perform mathematic analysis such as ‘Benford’s law’ very easily, or conduct a logic analysis such as searching for duplicates. Software is very useful to define your own formulas and algorithms that can be saved as scripts and reused in a different file if it is possible.
Some software will enable to enable visualisation. Visual charts can be used to analyse and find issues, but can also be used in reports. You can also find ready-made charts with country maps if you want to make a presentation by geographical areas.
Other software can generate standard lists of data that will help you add information in your dataset by providing for instance names of towns if you have postal codes, business information coming from databases or demographics by area. This can help you build a stronger dataset.
Preparing structured data
The methodology to work with structured data comprises three phases: extraction, preparation and analysis.
In order to proceed successfully through each of these phases, the following questions need to be addressed.
Relevant data to collect
To guarantee the integrity of data, you need to work on ‘raw’ data that has come from the relevant company’s IT system. For this reason, we recommend extracting the information from the databases yourself, rather than using data from business intelligence tools and to obtain details of all transactions as a starting point, rather than being provided with aggregated data, for instance daily sales by items versus weekly sales by customers. If you extract information yourself from the database you know where it is coming from and which parameters you used to get them and you can also document it by keeping the screenshots of these parameters. On the contrary, if you use a business intelligence database like Oracle you will not know from where and how data have been extracted and you do not know whether they have been altered. This is an end to end process so you need to guarantee the origin of your work to make it evidence.
While data can come from internal sources, data can be sourced from third parties. It is crucial that the integrity of third-party data is checked to ensure accuracy and reliability.
Sources of data
As seen in section ‘The benefits of using data searching and data mining’ on page 2, there are different types of structured data and different sources of information. Nevertheless, most of the time, they will come from the ERP tables. ERP tables will contain transactional data such as sales, permanent data such as suppliers’ details, and history of changes including changes in a document feature. The investigation objectives define the information requirements which need to be translated into data requirements. Only when data requirements are defined can data queries be created to extract data. Support from a systems analyst will be needed to bridge the gap between how the system hosts and organises its data and the data requirements.
There are numerous options for extracting data from ERP systems or from other data warehouses. All ERP systems offer users an interface from which data can be exported into an Excel file or another structured form, e.g. csv. This just requires only a resource, say a financial analyst, with read-only access rights and who has basic knowledge of where the ERP data is logically stored, e.g. ERP tables. Investigators must be aware that this procedure is exposed to the risk of human errors, for example the person tasked to extract data can run incorrect queries.
The alternative option is to extract data directly from the databases, which may be required in case of regulatory investigations because it gives more assurance in terms of independence and traceability of the investigation.
Provided that the necessary access rights have been obtained, extractions can be executed through either:
- running queries via client software installed on the PC, e.g. SAP Hana Studio, SQL Server Management Studio; or
- configurable extraction scripts.
Connecting with the data warehouse to directly perform the extractions necessarily requires the involvement of the in-house corporate infrastructure department, as well as those responsible for the data security, e.g. CIO.
From an organisational perspective, this requires having a logon name with permissions to access the database tables and views. Once an agreement is reached, it is important that the investigation team maintains the proper supervision of the extraction process to ensure completeness of the extraction. This can be ensured by obtaining the log files related to the extraction or by other means, screenshots of the extraction execution and confirmation of success.
Before you consider extracting the data, you will need to consider the time it will take to process the data and the volume of information it will represent, to check it is feasible in the context of a particular investigation.
Completeness and accuracy
It is necessary to ensure that the data source is accurate and complete, and to then validate data extracted from the source system. While the validation of data extracted can be checked by comparing totals, for example reconciling totals of the ledger accounts, the accuracy of input data source can present more of a challenge, as consideration needs to be given to organisational aspects such as whether the organisation has implemented adequate IT general and input controls.
The quality of data also needs to be checked to identify whether any information is missing in some columns. Data may be missing due to a technical issue, such as a missing date in an accounting transaction, or a process issue like a missing address in supplier master data. Missing data could also be due to formatting issues, to be checked with your IT system analyst. If data is not correct, you will need to exclude it from your data set.
Preparation and data analysis
Preparation is the phase that includes loading and transforming the data acquired to meet your objective. This involves two key aspects, (1) addressing any formatting issues, e.g. different data and accounting formats and (2) re-arranging the data to prepare it for analysis.
In relation to formatting, you need to check the format of the fields you want to compare to make sure the system can compare the two sets of data. You will then need to sort and organise data appropriately ready for analysis.
Tools to transform data
There are plenty of tools to load and transform data. Some can be used by anyone with a sound grasp of common IT tools, e.g. Microsoft Excel, Power BI, other licensed tools are commonly accessible to certain professionals, e.g. Idea for Audit or Tableau. Other tools are more popular with IT specialists e.g. SQL transformation scripts.
Microsoft Office’s Power Query can be a good starting point. Power Query is now integrated into all the data analysis and business intelligence tools from Microsoft, such as Excel, Analysis Services, and Power BI. It allows users to discover, combine, and refine their data from various sources.
The next step is the data transformation phase where the quality of data can be improved by adding information that is useful to your analysis.
Data transformation is virtually unlimited and can be applied on a single data set or when multiple sources are merged into a single data set, a process known as data blending.
Data analytics procedures provide meaningful results only if data structure is properly mapped. For instance, in the example above, account mapping is only helpful where purchases and invoices are properly posted in their respective sub-ledgers.
An example would be the ethics hotline of a construction company which received several allegations of conflicts of interests involving its employees and the suppliers of one of its subsidiaries. The investigation team obtained the ERP’s vendor and employee master data, by linking the two data sets, they discovered vendors with the same residential address or bank account details of some of its employees.
As mentioned, it is important that the format of the data is carefully checked to ensure comparisons can be created and a technical match can be carried out using the software.
You can enrich your dataset in various ways:
- assigning attributes to the initial data by adding existing information about a data, e.g. if you have an invoice recording, you can add information about the date of the recording or the user that recorded the transaction;
- assigning new attributes to the initial data by adding information about the data, e.g. you can add about a date what day of the week it is;
- assigning attributes to the initial data, e.g. by adding existing internal stratifying information about the data defining in which business unit an employee is working based on the organisation chart;
- stratifying the population based on quantitative or qualitative attributes, e.g. you can create a ‘top customers’ list, or identify those who have a discount over 10%;
- assigning attributes to the initial data by adding existing external stratifying information about a data, for instance: adding a rating about a customer using an external database; and
- creating links between two or more data sets through data blending linking purchases invoices with purchase orders information.
Your analysis will be more efficient if your final dataset comprises the smallest possible units of data. For example, if you consider sales by only looking at invoices, you will not be able to process the details of the sales items but also the other features; quantity, unit price, and discounts that could be reused in your analysis. When you want to analyse sales, if you take into account invoices data, it will give you only the total amount of the invoice and you will not be able to identify the unusual items in the unit price, the weight by unit, the way the price was computed, and list price discount. So you cannot rely only on invoices, you need to go further to find relevant variances. Another example would be payroll data that will need to be broken down into the different lines to enable, for example, wages for individual employees to be presented and compared.
How to conduct a data mining search
Data mining can be deployed to meet the same key objectives as more basic data searches, e.g. finding evidence or identifying suspicious trends. However, a data mining search is more advanced than a basic search because it allows for the application of formulas and for comparisons to be made between data. This will be done with structured data as they are easier to handle. But, as noted in ‘Choosing a strategy based on the circumstances of your case’ on page 5, unstructured data can be turned into structured data.
Data mining as a mini-IT project
Data mining is essentially a small IT project. It should involve testing a small dataset to check your model is working and extending to full scope when satisfied with the model. During the test phase, you may need to modify either your model or your data set to refine the analysis. Based on your hypothesis and objectives, you can start working on your analysis. It is often helpful, before starting a data mining exercise, to have a clear idea of what need to be done and how long it is likely to take. This will enable you to assess whether a data mining exercise is proportionate in the context of your investigation.
Objectives of the data mining
|Identify a piece of information containing key features you are using for your investigation.
|Trends between 2 series of information
|Unusual trends between 2 series of information or a correlation that can lead you to a conclusion or useful hard evidence.
|Outliers in an expected trend
|Identify outliers you need to investigate in an expected trend analysis.
|Outliers in an expected process
|Identify outliers you need to investigate in an expected process as there are missing steps in it, some KPIs are not in line with the expectations, or the pattern of tasks is unusual.
Best practice using the peeling method: If you do not know what you are looking for, or if there are too many possibilities or exceptions to put into your model, you can adopt the peeling method which involves peeling layers of results to find under them what has gone wrong.
Main methods of data mining
We identify three main methods of data mining that can be used to conduct your tests according to the strategies you defined. They can be used in isolation or in conjunction with one another if more analysis is needed.
Method 1: Visualisation of the results
You can analyse the data you select with a tool, e.g. Power BI, Tableau, Qlick that provides visual results through a chart or a dashboard. You then interpret the results to identify unusual trends or identify outliers from an expected trend. There are various types of charts that give you a different view of the results. You can compare two series of data using a classic 2D model but also three series using a 3D model. Based on the visualisation, you should then be able to identify either a case or an area of interest that will enable you to refine your searches. Testing through a visual tool can be really helpful and fast, but it can be also less precise compared with classification as you will need to spot issues yourself on a screen. A good approach may be to start with data visualisation and move on to other methods to investigate further.
The following techniques can be used when presenting results visually:
- Mathematics using regression on a graph. You can analyse your data by simply comparing two or three series of information within your data set and checked whether there is a correlation between them. From this analysis, you should be able to define a formula that shows the relationship between the two series of data.
- Mathematics using a dashboard. You can set up a dashboard using a set of data and change criteria to see which one is having an impact.
A key benefit of these techniques is that you can quickly test different hypotheses by simply changing the parameters.
A an example would be; your organisation is suffering a high rate of thefts of products in your warehouses, and you want to know what can cause it, so you are comparing the losses by products with other criteria to see whether there is a correlation with other events e.g. day of the week, staff in place.
Beware of bias
Interpreting charts and drawing conclusions from it can be challenging as biases can interfere and you could misinterpret the visual results as a consequence. Some of the main biases that can occur include:
- Survivorship bias drawing conclusions from an incomplete set of data because only certain data has survived applicable selection criteria see best practice below.
- False causality falsely assuming when two related events occur that one must have caused the other.
- Sampling bias drawing conclusions from a dataset that is not representative of the population you are trying to analyse.
Best practice survivorship bias: Do not consider only the data you obtained but consider missing data and identify gaps in series or variances.
Method 2: Data search
Data mining can consist in a simple search using key words run within the prepared data set or could involve more complex search criteria. This search can be based on selecting information that has been defined in the dataset, and can also be expanded upon as the review progresses, to mix criteria.
The most famous method is ‘Benford’s law’ based on distribution laws, but there are other and more basic number patterns such as:
- Round numbers which can be unusual in some cases, e.g. 10,000 GBP transaction and your search criteria can be used on them, e.g. you expect products with variable weight such as meat, fish, cheese to have a weight by unit different than 00, so a high percentage of this round number can be suspicious. This is a good example of a behaviour test strategy.
- Threshold numbers hurdles are numbers that are just below the threshold and their frequency can be suspicious and can be a search criterion, e.g. if you have an approval threshold of 1,000 GBP, you will check all the amounts that are just below 999 or 999.99.
- Duplicate numbers, if your dataset was clean and well prepared, any duplicates you find may be suspicious and you can use a built-in formula to determine duplicates on different criteria, e.g. one credit note can have been issued several times for the same invoice and you can determine it by counting how many credit notes are linked to one invoice.
Data searching will give way to a further review of the selected elements.
An example would be an investigation into an allegation of several millions of dollars funnelled through bogus invoices can result in a daunting task when there are thousands of potentially relevant transactions. The deployment of CAATs to the population of invoices would allow the transaction population to be analysed and sorted based on pre-defined criterions, e.g. round amounts, missing or brief description, and suspicious transactions. Manual review of the most suspicious transactions can then be prioritised.
Method 3: Classification or production of exception reports
Based on your analysis and the series of tests you design, you will generate a list of potentially unusual items you will need to investigate.
You are going to define tests within your data set that will classify the data between good and wrong. The most common type of algorithms will be based on decision trees with several steps using different questions that will exclude data so that you keep only the unusual ones. These tests include comparisons with other data.
An example is a basic step, for investigating procurement fraud, is a three-way match based on purchase order, receipt of goods document and invoice.
The result of the match procedure may give three different results; (1) no differences indicating that all information was reconciled; (2) difference needing to be investigated; (3) unavailable documentation meaning that the match failed for reasons to be investigated, e.g. because invoice was posted outside the sub ledger. You can set up an algorithm to classify invoices paid to identify whether there were paid in line with the receipt of goods documents by comparing the amount in both sources.
Reduce the volume of false positives
Exception reports will produce results that will need to be analysed as some of them, usually called ‘false positives’, will need to be identified and excluded from the results. Exception reports can be extremely cumbersome because of the volume of false positives they will contain. That is why it is important to anticipate them by reviewing the process and checking why they have arisen, e.g. amending the due date of an invoice can seem suspicious unless it is for a legal reason because the system cannot manage some constraints. To produce relevant and efficient exception reports, you need to include a step that will automatically identify the false positives. You can do this by reviewing the nature of the false positives you got and set up a model to sort them out. In the latter example about changes in due dates, you may find a feature in the payment terms that will show it needs to be amended and you can exclude these types of changes from your analysis to focus on the ones that are really unusual.
Best practice would be to break down your data as much as possible and rely on raw data to have more flexibility in the analysis and identify the smallest details
Implications of digital results as legal evidence
When your searches and testing are complete, you can start using your results as evidence, but you need to be aware of the legal considerations.
When an investigation has legal implications, it is essential to obtain professional legal advice, e.g. to advise on and protect material that may be legally privileged, advise on preservation of evidence, the issuance of litigation hold notices, handling of evidence in a forensic manner and to advise on legal obligations and strategy in the context of a criminal or regulatory investigation.
It is important to preserve original material and sources of data in that form in order to avoid compromising potential evidence. EnCase™ is an accepted forensic system. It is a standard in many court systems for the purposes of finding, decrypting, collecting, and preserving forensic data from a wide variety of devices. This can include laptops, mobile devices and cloud servers. EnCase also deals with files that although logically may be deleted, can still physically be stored in a sector of the machine’s disk. Once a copy is obtained, it is then common practice, for the purposes of preventing the loss of evidence, to make additional copies in order for storing.
Data production includes the product of the investigation which is made available to third parties which may include regulators, law enforcement bodies, or private parties that may be involved in a criminal or civil trial.
Data production can therefore include a variety of documents, e.g. documents retrieved and documents with markups. It also provides an analysis of the data mining and data analytics which has been conducted by the investigation team. Importantly, before documents can be disclosed, one must consider whether it is necessary to redact parts of documents which are confidential or which contain personal data and should not be disclosed.
In presenting the data to court, it is essential to ensure:
- Data production has to be introduced in accordance with the rules valid in the court system. This includes, among others, the document numbering, the method of de-duplication and privilege designations.
- The integrity of the evidence used as part of the court case. This can be satisfied by preserving native files and by collecting the evidence in a forensic manner, e.g. including preservation of the chain of custody.
- A proven connection between the hypothesis of a case and the outcome of the data analysis, e.g. in a corruption case involving a middleman acting on behalf of an organisation to corrupt a public official, by applying Social Network Analysis techniques through the analysis of email correspondences or other communication means, a link could be established between the organisation and the corrupted official. With respect to accounting data, data mining may show inflated invoices for the services rendered by the ‘middleman’ involved.
When introducing evidence obtained as part of data mining or technology assisted review techniques, it is important to verify beforehand whether such evidence will be accepted pursuant to the rules of the court. This consideration is particularly important due to the court being able to reject evidence produced if its expectations are not met.
While each case needs an individual assessment, in recent years there have been signs that rules of civil procedures are more open to accepting evidence produced based on data mining or technology assisted review techniques.
Data science is useful for investigations because it can enable the identification of digital evidence or lead to physical evidence through the processing of digital information. Despite the evolution of tools, technology will not be working alone, and investigators are still making hypotheses to set up the models that will be used for the analysis. Testing strategies can be summed up in 5 categories: key features search, compliance, consistency, comparison, and behaviour. These tests will be conducted on the two types of data that can be used (1) structured data listed data such as ERP transactions and (2) unstructured data emails, files. The processing of these two data types will be different but will follow the same logic, extraction, preparation, and analysis. With respect to extraction, you will need to make sure your data is complete and clean. The preparation phase will help you gather and link relevant information for use in the third phase of analysis to identify results that is unbiased and void of false positives. Data mining with structured data can be very complex with robust algorithms. Consideration will need to be given to how the results will be used in the case. Results and investigation concerning personal data will need to be processed cautiously and in line with GPDR in the EU and the UK and in accordance with applicable local data protection private legislation. To be valid for the court, the digital proofs will need to be documented and detailed but will need to be checked by lawyers from the start.
Legally reviewed by Polly Sprenger (Addleshaw Goddard LLP).
 “Mining of massive datasets”, Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman (Standford).
 Page 2, Meeting the New Unstructured Storage Requirements for Digitally Transforming Enterprises, Eric Burgener, June 2021 (https://www.delltechnologies.com/asset/en-us/products/storage/industry-market/h18811-wp-idc-new-unstructured-storage-requirements-for-enterprises.pdf).
 The Immortal Life of the Enron E-mails, Jessica Leber, July 2, 2013 (https://www.technologyreview.com/2013/07/02/177506/the-immortal-life-of-the-enron-e-mails/).
 Nuix eDiscovery User Guide.
 For a guideline about the implementation of CAATs « G3 USE OF COMPUTER-ASSISTED AUDIT TECHNIQUES (CAATs) ».
 Learn Power Query (L. Foulkes, W. Sparrow, 2020).
 “Dusting your data for fraud’s fingerprints: six number patterns that fraudsters use”, Fraud magazine, November/ December 2020.
 In « Da Silva Moore v. Publicis Groupe » the Court accepted of the predictive coding protocol to exclude unreliable expert testimony from being submitted to the jury.
In «McConnell Dowell Constructors (Aust) Pty Ltd v Santam Ltd & Ors (No. 1) ,1 Vickery J » it was considered the appropriate process for managing discovery in a large dispute concerning the design and construction of a natural gas pipeline in Queensland. (http://www6.austlii.edu.au/cgi-bin/viewdoc/au/cases/vic/VSC/2016/734.html?stem=0&synonyms=0&query=2016%5d%20VSC%20734).