Cross-border overview: electronic data in forensic investigations

This is an Insight article, written by a selected partner as part of GIR's co-published content. Read more on Insight

Nowadays, close to 100 per cent of documents are generated and kept in electronic format while a fraction (some say less than 1 per cent) make it to paper. Given the duplication of content, the fact that we have all become electronic pack rats and the way corporate disaster recovery planning is handled, data that would (should?) normally have been destroyed will remain available. Under these circumstances, no serious investigation can overlook electronic data.

This chapter will explore the different phases of the electronic data discovery process, following the Electronic Discovery Reference Model (EDRM){{+footnote}} in the context of a cross-border investigation.

Identification: what are you looking for?

The first step is perhaps the most important from a completeness, thoroughness and defensibility standpoint. During the identification phase, the investigation team must define the scope of their investigation by identifying:

  • Who? The persons (natural and legal) involved in the investigated facts.

  • Which? The types and sources of potentially relevant information.

  • When? The relevant period(s) of time over which these facts occurred or the information pertaining to them has been created or received.

This identification phase, if completed properly, will enable the team to understand the what, how and why, which are often the objectives of the investigation.

Historically, the investigation team tended to be composed of lawyers (in-house and outside counsel), internal audit and forensic accountants. Now, in order to properly scope their work, they should generally include some or all of the following professionals:

  • forensic technicians;

  • data analysts;

  • information technology and security personnel;

  • projects and records managers; and

  • litigation support specialists.

Preservation: tomorrow, you’ll need what you have today

Preservation is the phase during which any potentially relevant data is ‘set aside’ for possible future collection, processing and review. In the paper era, originals would be set aside and copies would be generated for work and investigation purposes. In this day and age of electronically stored information (ESI), new processes must be developed to ensure proper preservation of evidence. ESI can be altered fairly easily, often involuntarily, or destroyed inadvertently, rendering their retrieval more burdensome and expensive, or entirely impossible.

Multiple angles must be covered to ensure preservation: human, process and technology. So far, the best approach has been defined as a ‘legal hold’, which is normally initiated by the issuance of a notice to all custodians of potentially relevant ESI to suspend the alteration or destruction of said ESI, or any process that may have the same effect (computer reimaging or reallocation, data conversion or migration, backup tapes recycling, etc).

Legal hold notices normally provide an overview of the most important facts (when they’re not too confidential) as well as general instructions to preserve paper and electronic documents. These instructions are often supported by a listing of file formats, locations and sources, and relevant concepts (in the form of keywords). Most notices will be accompanied by a mechanism whereby the custodian will confirm reception, understanding and compliance. These notices will be updated on an ongoing basis to take into consideration the iterative nature of investigations and include newly discovered elements that are now in scope. Even when no updates are required, it is advisable to send reminders to the custodians to make sure they have not forgotten and will not forget about their obligation until the legal is officially released, in writing. These legal hold notices are also sent to third parties.

Either with or after the circulation of the legal hold notice, it is customary for the custodians to receive a questionnaire and/or take part in an interview to provide details about their involvement in the case but, more importantly from a discovery standpoint, about their information management practices. The following questions are some examples of what is asked in those questionnaires and interviews:

  • Where do you save your documents (paper and electronic)?

  • Do you use an intelligent phone? If so, is it synched with the organisation’s e-mail server? Do you use text messaging?

  • Do you use cloud or hosted services (eg, Google Docs, iCloud, Gmail and

  • Do you send documents to your personal account or home devices?

In mature discovery-ready organisations, a legal hold policy with detailed processes, clear roles and responsibilities and model notices, is normally in place and should be used to ensure defensibility and standardisation of resulting efforts. Note that many local policies should be updated to include the extra steps that are made necessary by the international nature of the investigation. Organisations that are experiencing cross-border investigation for the first time or that do not have mature processes should retain an internationally experienced service provider to help them.

Important elements that differentiate preservations in a local investigation or litigation from those in an international context are the applicable laws, regulations, case law and best practices relating to preservation. In some jurisdictions, such an obligation may not exist at all! Accordingly, from the outset of the investigation, it is essential to define the applicable ground rules in each jurisdiction and obtain proper counsel or court guidance to deal with issues that may arise throughout the investigation, including privacy laws; blocking statutes; conflict of laws; and political, cultural or religious sensitivities.

It must be noted that, in cross-border investigations, the preservation process often becomes quite complex and requires sophisticated project management to ensure proper scoping, compliance tracking and remediation, and overall defensibility.

Collection: gathering evidence

After reading the above preservation section, you must be wondering how best to preserve potentially relevant evidence. The secret has been kept for the present collection section because, as illustrated in the EDRM diagram (see endnote 1), collection is often the best means to preserve ESI. As in the paper world, no one would have relied exclusively on the original – annotating it or circulating it to third parties (external counsel, regulators, etc). Traditionally, work copies were made and originals were set aside. This is what should happen with ESI as well.

In fact, as further explained below, native file collection is even more important in the electronic world given the important contextual information that is embedded in the ESI itself, as well as its logical or physical support.

DIY collection

In the early days of electronic discovery (e-discovery), many lawyers (mainly in-house including one of the present authors) pleaded for a do-it-yourself approach given the significant cost of forensic collection. Back then, it was common to pay as much as US$2,000 per computer. Today, with the development and democratisation of new technologies, ESI can often be collected forensically for less than US$500 per device in less than one hour. Under such circumstances, DIY collection provides no ROI when one looks at the time (and indirect costs, loss of revenues, burdens and risks) spent by resources to search, find, review and extract potentially relevant data from their devices.

However, the most important justification for leaving DIY collection behind is its negative, and sometimes fatal, effect on the integrity of the evidence (eg, metadata alteration) and the obliteration of many efficiency gains provided by native unaltered ESI (eg, filter and search).

Forensic collection

Mirror imaging

The Sedona Conference defines ‘forensic copy’{{+footnote}} as an exact copy of an entire physical storage media (hard drive, CD-ROM, DVD-ROM, tape, etc), including all active and residual data and unallocated or slack space on the media. Forensic copies are often called ‘images’ or ‘imaged copies’. Under most circumstances, full imaging is not required and may not be justified from a pure collection standpoint. However, the iterative nature of discovery often militates for imaging from a preservation standpoint to prevent the necessity to go back to the original device (if it still exists and works) to re-collect.

Targeted forensic collection

In most cases, a targeted forensic collection of user-generated content (eg, MS Office, PDF, etc) will be sufficient to preserve all the potentially relevant ESI and prevent one from having to return to the source at a later date. Such a collection methodology ensures the retention of all native files and associated metadata. As will be explained below, it also facilitates processing and review, while reducing burden, costs, risks and delays.

Processing: reducing the volume and standardising documents

The processing phase takes all documents, paper and electronic, and standardises them for review in a central repository. It is also used to reduce the volume of documents, by excluding obviously irrelevant files (eg, system or software, ESI outside of the relevant period, file formats), and to identify exceptional documents that will require special handling (eg, corrupted, password-protected and encrypted files, handwritten documents).

Processing unstructured data

Unstructured data is data that is not organised into fields in a database according to a predefined and enforced model. The most common kinds of unstructured data are MS Office and PDF files, images and multimedia files. While not per se structured in a database, they can be organised according to different models, thanks to their metadata. Without entering into the details, let’s just say that metadata are the file properties: author, date of reception, creation or modification, subject, etc. The easiest way to understand metadata is to imagine your e-mail inbox: each column is a metadata field, which enables you to filter and reorganise data according to your preferred model.

Unstructured data processing is normally comprised of different steps such as the expansion of compound files (ie, extraction of files contained in a .ZIP, .RAR, .PST, etc.), deduplication of identical and/or similar files (known as neardupes – eg, PDF version of a .PPT, previous version of a .Doc), their filtering by metadata fields, keywords and concept searching.


Paper is a different breed of unstructured data as it cannot be electronically processed in the above-mentioned way before being imaged (scanned) and processed with OCR (optical character recognition) software. Even when scanned, though, paper still has no metadata with which to filter or sort. In some cases, it may be worth manually coding certain or all imaged documents, ie, capturing relevant information from the face of the document (date and signatory of a letter, recipients, etc).

Another challenge with paper documents is the handwritten ones, which normally require manual handling, or the reliance on the still relatively low-quality results of handwriting recognition software. However, this situation is bound to progress with the advent of tablets that offer handwriting capabilities.

Processing structured data

It is now almost impossible to envisage an investigation of reasonable size without taking into consideration what the backbone of most transactional systems is: structured data. This section will highlight what differs between structured and unstructured data, how it affects the investigative approach, how proper planning is key to making the most of transactional data, and key aspects of the methodology to be adopted in the context of cross-border investigations.

While unstructured data requires a specific approach, covered in this chapter, the realm of numbers, text fields, columns and tables, referred to as ‘structured data’, is to be handled in a different fashion and involves different types of expertise, such as computer and financial analysts, actuaries and forensic accountants.

The most typical set of structured data comes from accounting systems – general ledgers and sub-ledgers – which are fairly consistent from one system to the other. For instance, a complete purchase cycle is likely to include at least a supplier master file, purchase order files, invoice files and payment files. The underlying data supporting a purchase in most accounting systems will follow this structure and the usual set of rules and controls that are characteristics of purchase-to-pay leading practices.

As an investigation may aim to uncover irregularities in the books and records, having a good understanding of the accounting rules applicable to transactional ledgers is key to pushing the structured analysis further and making the most of the structured data available.

However, many investigations now target non-financial information in databases such as CRMS{{+footnote}} and CMS,{{+footnote}} which require similar but sometimes different data analysts and SMEs.

Identifying, preserving and collecting structured data

The investigation team must develop a good understanding of the systems in place, the interfaces between such systems and the characteristics of local business units. For instance, if an allegation implicates two business units in different countries, one of which relies on an accounting system that is not integrated with the consolidated entity (a frequent occurrence after the purchase of a business), the data analyst will have to take into consideration the two distinct systems from which data must be collected, the interactions between these systems and any conversion or consolidation issue that may arise, in order to properly analyse the data. Languages and interactions between systems in a cross-border context might also be an issue that needs to be addressed early to be managed in each step of the process (eg, an Indian system not compatible at its most granular level with its American counterpart).

Clarifying communication channels

Investigative teams often work with legal, internal audit, finance and board committees. When dealing with systems and structured data, establishing communication channels with IT professionals is useful in getting quick access to data and prompt answers to technical questions. As systems are usually configured from the back-end by IT teams, having their input in the early phases of the work allows the team to quickly understand probable technical challenges such as interconnectivity issues, data quality issues, available fields or simply getting estimates of the time required to extract the data.

Methodologically processing structured data


In the planning phase, objectives of the analysis are discussed with the investigation team. This is where the interaction between the data analyst and the lead investigator is discussed and expectations are set. Depending on the situation, interactions between the data analyst and the investigative team can be numerous – in some cases, data analysts will be embedded in investigation teams and constantly interacting with the rest of the team. The business value of data, availability of data and contingency plans are also discussed as part of planning. Background information is communicated to the data analyst so that any aspect of the allegations or characteristics of organisations or individuals at hand that might impact the work are known in advance. Preliminary financial information (related to the data to be obtained) will be analysed, charts of accounts (which might be different from one country to another or written in various languages) are obtained and understood – sometimes with the help of local resources, as some accounting practices or types of accounts might vary in different countries.


At this stage, system information is gathered. A preliminary risk assessment is conducted to establish what the analysis will focus on. A determination on availability and quality of data is made. It is at this stage, and based on the analyst’s preliminary understanding of the data at hand and the tests to be performed, that the set of tools to be used will be identified. These could be:

  • data cleansing tools – used before the data is loaded in the work environment to format the data in a way which makes it usable for the analyst;

  • query tools – used when one knows what needs to be extracted from available data. For instance, listing the monthly top five vendors for a specific business unit authorised by a certain employee;

  • visualisation tools – used to explore the data or deliver information with multiple levels of complexity in a visual manner. For instance, creating a map showing colored dots representing warehouse sales (size of dots) and profit levels (color of dots);

  • data mining tools – used to uncover unclear or hidden relationships within the data. For instance, 96 per cent of the time a user authorises a transaction, the authorisation relates to three vendors out of a total of 500; and

  • modelling tools – used to project results in predetermined scenarios using existing relationships within available data to understand what a situation would be under certain circumstances. For instance, projecting the level of profit of a business unit if sales would increase by 5 per cent.

Design and implement

This phase consists of the formulation of data requests, the consideration of potential security issues (especially in obtaining the data, secured online channels should be used whenever possible) and the design of tests to be performed on the data to be obtained. The data is then obtained, loaded into the testing environment and validated for completeness (to the extent possible). Depending on sources of data, various levels of data cleansing might be necessary (for instance, reformatting numbers stored as characters, standardising supplier names, removing sub-totals from tables or converting printed reports or other semi-structured text files into structured databases). Cleansing is frequently performed in investigations where data is obtained from multiple sources. When it is stored in uncommon formats or only available in paper format, several steps are to be taken to scan and import the data. In the context of cross-border investigations, data cleansing issues will likely include date format conversion (eg, from dd/mm/yyyy in France to mm/dd/yyyy in the US), number formatting (eg, some countries use spaces as thousands separators instead of commas) or text characters (eg, what to do when dealing with a transaction description written in Mandarin?).

Execute and evaluate

The last two stages of the methodology comprise running the tests, the formulation of preliminary observations that will be validated with the investigation team, the identification of root causes and further mining as dictated by the circumstances of the investigation.

Document review

With today’s advanced hosted document review tools, it is possible to conduct complex international investigations, involving multiples teams of reviewers, within the same review tool hosted in one location. All that any one team member needs to be able to perform review is a computer, an internet connection and a compatible browser. Review workspaces can be configured to give each team, even each individual reviewer, a special set of permissions. Some reviewers can perform all or most functions; others will be allowed to see only certain documents and perform only those functions that the review managers assign to them.

Any international investigation should address, in its initial planning discussions, questions such as: What tools do we have? Where do we have the servers that can host the data? Who are our best administrators? Who will manage the data transfers, processing, uploads and exports?

Once decisions like these have been made, a key next step is to decide who will be the reviewers, at what levels, for what investigative purposes, and who needs to receive what kinds of training to ensure best use of the tools available. Team leaders should not assume that all of these things can be worked out, or will be taken care of, at the local level, across multiple offices. Uniform training sessions, with particular emphasis on review protocols, field names, and tool capabilities, will more than pay for themselves in efficiencies and mistakes avoided.

All of the leading review tools provide extensive search capabilities, including standard keyword, Boolean operators, proximity, wildcards, stemming and fuzzy searching. They all allow for nested or ‘deep’ searches (searches that reference other searches, whether they include or exclude the other searches’ results). Almost all will also provide for concept searching.{{+footnote}} Using concept indexes, review teams can cluster documents by theme, perform categorisation of large volumes of documents according to pre-defined topics (and then test the accuracy of the tool’s categorisation), identify ‘similar documents’ (ie, find conceptually similar documents, even when they do not contain certain words or phrases) and, in more complex and structured cases, use predictive coding.{{+footnote}} As new issues arise and new kinds of relevant documents are identified, concept indexes can identify similar documents, quickly yielding new review sets to meet a team’s changing requirements.

All of the leading tools support multiple languages, including CJK{{+footnote}} and other non-Western European languages. Not only can these characters or idiograms be displayed; they can be searched for, as long as the user has the appropriate keyboard and settings. Furthermore, advanced tools can auto-classify documents based on their language, easing keyword searches and enabling batching of documents to the right review team.

Where members of the review team are working in other time zones – and even where the activities being examined occur across multiple time zones – modern review tools allow, not just standardisation to a single time zone, but for each reviewer to ‘see’ the date time stamp that was true where the action took place. Thus, a reviewer in New York can see an e-mail that was sent from Berlin in Berlin local time and another e-mail that was sent from Lima in Lima local time.

In multinational reviews, efficiency, the standardisation of settings or protocols and security concerns argue in favour of one ‘admin’ location, with all data being processed and loaded by one team. However, a team may want to allow designated individuals to load data directly into the team’s review tool from their own location. With high-speed internet connections, large volumes can be transferred in reasonable time-frames, and advanced file-transfer tools, when combined with well-designed firewalls and other security measures, allow for ‘distributed admins’ – designated team members sending, processing and uploading their own documents without having to send hard drives by courier.

With multiple data sets being collected from multiple locations, an obvious challenge is duplicates. As long as each local admin can perform initial metadata-only scanning to derive hash values (all admins doing so according to the same specifications), it is possible to run deduplication across multiple datasets simply by sharing the lists of hash values (digests). Centralised deduplication by digest saves time and money and the resulting selected files can then be moved through to processing and upload.

Cautions relating to collaborative online review

Any cross-border review team should be careful to discuss and reach agreement in advance on the following:

  • time zone settings (What is the default? Whether and how to populate application and file metadata, as well as back-end system fields, to allow users to see local time for each document);

  • date formats (eg,, mm/dd/yyyy, dd/mm/yyyy or yyyy-mm-dd);

  • text encoding to be used from processing through to production (eg, Unicode UTF-8);

  • assignment of tasks, and

  • development of protocols, for such things as:

  • metadata-only inventory of pre-processed files;

  • deduplication (choice of hash settings);

  • processing settings (eg, whether to separate attachments from e-mails);

  • field names in loadfiles;

  • field mappings for upload; and

  • review design choices (agreeing on coding fields, choices, tiered review protocols).

Essential to any successful collaborative review is the development and use of file- and field-naming conventions, dataset tracking systems, metrics and reports.

It is also important to understand the tools being used – their capabilities and limitations. Even when using simple keywords, a failure to understand how the index system works (or was configured) or to anticipate local variations in spelling can result in over-inclusiveness (false positives) and/or under-inclusiveness (false negatives). Team members should not assume that others on the same team will use the same settings, templates and parameters. With the many ways that fields and file associations can be understood, discussion and agreement in advance are essential.{{+footnote}}


Until recently, all exchanges of documents, even when review teams were using highly sophisticated review platforms, required exporting the files to a folder structure, copying those files to media, and sending that media. Today’s tools allow one party to disclose/produce to another party by simply making the designated files available within the same review platform, using special accounts and permissions. For large international investigations, this latter approach is by far the most efficient way of producing files.

Where files are being shipped on physical media, there are two key considerations: compatible formats and the encryption of the data. ‘Compatible formats’ here simply means sending the files in a format that the receiving party can use. A minimal production would consist of non-searchable image files and some kind of loadfile (indicating document unitisation and also providing basic index information and perhaps some coding fields such as date, subject). It is becoming more and more common, at least in litigation settings, to produce native files for everything not deemed privileged; images where redactions are needed; searchable text for all files (altered in the case of redactions); and generous, rather than limited, coding/metadata. A loadfile is still required, in a format suitable to the tool the data is to be loaded into. As always, agreement on loadfile formats and field names is beneficial: some tools work with template field names and, if incoming loadfiles do not use those exact names, manual renaming is needed.

Where parties are not willing to produce documents by opening up the designated documents within a single review platform, they can still save time and cost by transferring productions by secure file transfer. One side uploading to a secure site and the other side downloading from that site can be faster and cheaper, even with large volumes, than transferring the data to media, encrypting the media, sending the media and extracting the data. As with productions on physical media, it is useful to agree on naming conventions and loadfile formats.

Cautions relating to productions

Whether an investigative team is producing to another party, an investigative body or a regulator, it is essential to discuss as early as possible and reach agreement on the technical specifications that will govern the production or transfer of data. Knowing what your own tools can do (and not do) and what the receiving entity’s tools require in the way of files and fields – as well as what its administrator is expecting – is invaluable. Advance discussion of this kind not only facilitates data sharing in an efficient manner; it also helps to avoid misunderstandings that would have a more substantive impact in the investigation, relating to such crucial factors as date formats,{{+footnote}}local time stamps,{{+footnote}} e-mail address formats,{{+footnote}} the use of non-Western languages,{{+footnote}} and so on.

An investigative team can avoid significant time and cost if it can agree (with the opposing side or its fellow teams) on production formats, including file naming conventions, prefixes, fields to be included, exact field names, field types (numeric, text, date, etc), date formats, time zone settings, text encoding (especially for non-Western European languages) and loadfile formats (tokenised versus text with delimiters versus database formats). Significant costs often arise when non-technical members of the team make decisions on these matters (or do not realise that assumptions are being made by others), leading the technical people to have to spend considerable time remediating the data.

For any transfer of data on physical media or electronically, strong encryption methods are essential. Not only should appropriate encryption tools be used; so too should best practices relating to the transmission of key files and passwords.


In the end, one must recognise that no matter if you have the best lawyers and forensic accountants, the most important part of a successful investigation is the relevant information. E-discovery and data analytics have become major contributors to cross-border investigations in the past decade and the trend is that this partnership between the world of unstructured and structured data will get stronger as time goes by and investigations gain in complexity. Data, in all its forms, surrounds us and grows exponentially. Making the most of it only make sense – as long as we prepare accordingly.


  1. Available at

  2. See The Sedona Conference Glossary: E-Discovery & Digital Information Management, 3rd Edition, September 2010.

  3. Customer Relationship Management System as defined by Wikipedia at

  4. Content Management System as defined by Wikipedia at

  5. Concept-based indexing (latent semantic indexing and other technologies) identifies and compares the conceptual or semantic content of documents using a mathematical analysis of word occurrence, co-occurrence and proximity.

  6. Predictive coding is not simply a technology; it is the use of a specific kind of technology in a carefully structured workflow in which a human reviewer teaches the tool to perform an automated task (usually to make a binary judgment: relevant or not relevant) so as to replicate the reviewer’s coding decisions and then the results are tested using statistically sound methods. The underlying idea is that (i) the semantic content that makes a document ‘relevant’ can be determined mathematically using advanced indexing techniques; (ii) the machine can learn to identify different content if its initial guess is corrected (it improves its algorithms to more closely replicate the human decision); (iii) a machine can learn to do this extremely quickly and can achieve levels of performance that match or exceed human reviewers; and (iv) the machine does not get tired and does not vary in its judgments or performance; therefore (v) the job of reviewing massive volumes of documents can be performed more quickly (and therefore cheaply) and with greater accuracy through machine learning than through human-only review.

  7. CJK means Chinese-Japanese-Korean.

  8. For example, some speak of family groups; others of associated files. Some refer to Parents; others to Top-level documents. Some use file naming conventions that ‘look up’ (a document has a parent, so the document refers to its ParentID and, above that, to its GroupID), while others use conventions that ‘look down’ (a parent document refers to its AttachIDs). For some, the date of the top-level e-mail that is shared by all family group members is the lead date; for others it is the unified date (or other names).

  9. The m/d/yyyy versus d/m/yyyy ambiguity affects 132 of the 365 days of each year. Where an abbreviation is used for the year, more layers of ambiguity are introduced: 07/10/14 could be 10 July 2014; 7 October 2014; or 14 October 2007.

  10. Some tools, unless configured otherwise, will provide all DateTime values in UTC. Administrators can standardise all values to a single time zone. Administrators can also apply different time zone settings to different datasets according to where the data was collected or where the reviewers will be located. This time zone value can then be stored in its own field and used in a receiving party’s tool to automatically adjust the date time value displayed in the viewer, so that an e-mail sent at 12:46pm in Berlin will show ‘12:46pm’ even to a reviewer in Toronto. Where these settings are not properly used, or their use has not been discussed, reviewers / investigators may misunderstand what they are seeing.

  11. Depending on the e-mail server, the method of collection and the processing settings, an e-mail address field may: be blank, contain both the display name and e-mail address, contain only the display name, contain only the user’s internal username, contain only the e-mail address, or display only the internal server account name.

  12. CJK characters, if converted to ANSI/ASCII/Western European encoding, will appear as squares or other meaningless characters.


Unlock unlimited access to all Global Investigations Review content