Skip to main content

Modelling health-data breaches with application to cyber insurance

By

 

Abstract

Data breaches have been increasing noticeably after 2021 notwithstanding the efforts of regulatory bodies to strengthen cybersecurity measures to protect health information. We provide a modelling framework that assesses the risk of private health data breaches focusing on the data sets compiled by the Privacy Rights Clearinghouse and the U.S. Department of Health and Human Services. We show that the counting process of the data-breach incidents is adequately modelled by the Markov-modulated non-homogeneous Poisson process (MMNPP) whilst the logarithm of the breach sizes is well-captured by the generalised Pareto distribution. The cyber insurance premium per institution and two risk measures Value-at-Risk (VaR) and Average VaR are obtained. The computed results indicate that cyber insurance policies with longer maturity are more cost effective. A comprehensive analysis, parameter estimation and implementation of the MMNPP to model cyber risks are underscored as the principal contributions of this research. Some implications to practitioners in handling the modelling of data breaches for a group of institutions are given.

 

Introduction

The digital revolution has led both public and private organisations to virtually rely on electronic information that entails information processing and information technology development. A growing concern about this information reliance and progress in digital technology is the risk associated with computer security, also called cybersecurity or information technology security. Cybersecurity risk, or simply cyber risk, refers to the potential failure of information systems that may cause financial loss, operational disruption, and other related damages. Incidents that compromise the information systems are called cyber attacks, and the most common result of cyber attacks is data breaches by criminals who copy, transmit, view, steal or use sensitive, protected or confidential data (United States Department of Health and Human Services, 2015).

The past and prevailing rise of data breaches has become a clear and significant threat to hospitals and healthcare systems (Compliance Group, 2022). It must be recognised that advances in internet technology (IT) supporting health care information management are signs of societal development and progress. However, an increasing use of platforms for electronic health records (EHRs) (i.e., patients’ complete and up-to-date medical and health history) seems to have brought as well in the rise of data breaches. In 2021, cyber attacks were more rampant as cyber criminals took advantage of the information technology (IT) vulnerability of hospitals and healthcare systems that are too pre-occupied responding to the COVID-19 pandemic (Pino, 2022). The 2020 Health Care Cybersecurity Survey (Healthcare Information and Management Systems Society, 2022) of the Healthcare Information and Management Systems Society (HIMSS), involving 168 U.S.-based industry professionals, asserted that robust cybersecurity is a must for all health care organisations. The U.S. Department of Health and Human Services (HHS) (United States Department of Health and Human Services Office for Civil Rights, 2021b) was created with the mission of enhancing the health and well-being of all Americans. Such a mission could be accomplished through efficient health and human services, and fostering sound, sustainable scientific advancements geared towards medicine, public health, and social services. National standards for securing sensitive patient health information are embodied in the Health Insurance Portability and Accountability (HIPAA) Act of 1996 (Wikipedia, 2022). The HHS made cybersecurity a priority in 2022 and urged HIPAA-covered healthcare entities to patch up security gaps that enable hackers easy access to organisations’ computer servers (Diaz, 2022). Severe penalties for HIPAA violations could be meted out to fortify the security protection of healthcare information. The maximum fine may reach $1.5 million for HIPAA violation due to willful neglect and is not corrected. The largest fine of $5.5 million, for example, was levied against Memorial Healthcare Systems in 2017 for accessing confidential information of 115,143 patients; see Alder (2017). Regulations mandate healthcare providers and entities to notify patients impacted by breaches of protected health information (PHI) if more than 500 individuals are affected. Besides, these reports will be made public to the HHS (United States Department of Health and Human Services Office for Civil Rights, 2021a), and media organisations. With the earliest records tracing back to October 2009, nearly one million individuals recently were affected by the five largest data breaches reported by the HHS in February 2022 (Adams, 2022).

Xie et al. (2020) emphasised the significance of cyber insurance policy design and stated that changes in cyber insurance loss ratios are not driven by the appreciation in premium but by claim-frequency and severity growth. Due to its importance and timeliness, contemporary issues have driven vigorous research for modelling cyber security risk. Unfortunately, there are still limited public resources to test and validate models. In the meantime, researchers in this area implement their frameworks on the data sets collected by the Privacy Rights Clearinghouse (PRC) (2021), which maintains the largest and most extensive publicly available database. A PRC data set was previously studied by Edwards et al. (2016), who concluded that breach sizes could be modelled by the log-normal family of distributions and the daily frequency of breaches could be described by a negative binomial distribution. Eling and Loperfido (2017) used the log-skew-normal distribution to model data breach sizes in conjunction with the use of multidimensional scaling and goodness-of-fit tests. Xu et al. (2018) modelled the inter-arrival times of hacking-data breaches with the autoregressive conditional mean (ACD) model and depicted the breach sizes by the ARMA-GARCH model with the dependence between the incidents and the breach sizes modelled by the Gumbel copula. In Sun et al. (2021), the breach frequency was modelled by a hurdle Poisson model and the breach severity was fitted to a non-parametric generalised Pareto distribution whilst the dependence between the breach frequency and severity was captured by a Gumbel copula. Bessy-Roland et al. (2021) examined the arrival of cyber events and demonstrated the ability of the Hawkes models in pinning down the self-excitation and interactions of data breaches depending on their type and targets. The sparsity of breaches experienced by individual enterprises overtime was circumvented effectively in Fang et al. (2021) by leveraging the inter-entity or inter-enterprise dependence between multiple time series. For an overarching review of cyber-risk modelling and cyber insurance, see Eling (2020); Eling and Schnell (2016) and Zeller and Scherer (2021).

Outside of the US, an increasing trend in the medical data breaches is observed as well. According to the Verizon’s data-breach investigation reports (Verizon, Verizon), based on the pooled worldwide regional incident data, the confirmed data-breach occurrences in the health care industry exhibit an overall increasing pattern for the period 01 November 2017–31 October 2021. Although with sources that are mostly organisations external to Verizon, the incident counts in Verizon’s report are rather small compared to those in the PRC or HHS data set.

One notable example of a data-breach incident outside of the US is the medical-record breach in Singapore. In this case, some 73,000 patients’ records were leaked due to ransomware attacks at an eye clinic on 06 August 2021 (Haworth, 2021). As a result, Singapore enacted a data-breach notification law in 2021 requiring “notifiable” breaches to be reported to the data protection office. Notifiable in this instance means either significant harm was brought to persons whose information was compromised or at least 501 persons were affected by the data breach. Failure to notify the Cybersecurity Commissioner within three calendar days will result to a fine of up to 10% of the organisation’s annual turnover or SGD 1 million ($742,000), whichever is higher.

In Europe, Dedalus Biologie was imposed a fine of 1.5 million euros by the French Lead Supervisory Authority, for a massive medical-data breach impacting nearly 500,000 people on 23 February 2021 (European Data Protection Board, 2022). Given the unstoppable trend of data breaches, 137 out of 194 countries had put in place legislation geared towards data privacy and protection (United Nations Conference on Trade and Development, 2021). Nonetheless, even if many legislative bodies (e.g. the EU parliament) have mandatory data-breach protection decrees, there are still limited sources that could provide well-structured public data sets focusing on worldwide data breaches (Kierkegaard, 2012). In contrast, there has been considerable progress in the US in terms of public availability of information surrounding data-breach sources and their data-collection process. Challenges in constructing a global database for data breaches include amongst others the lack of strong compilation standards and insufficient details on the data sources as pointed out by Neto et al. (2021) and the unknown number of unique data-breach sources. Indeed, limited data contribute to fewer quantitative research works aimed at establishing dependable frameworks to support the modelling of global-data breaches. Within the examination of cyber risk events in which 25% are data breach incidents (Eling and Wirfs, 2019) – utilising the SAS Global OpRisk and PRC data sets – identified “cyber risks of daily life” and “extreme cyber risks”. In Eling and Wirfs (2019), the peaks-over-threshold method was employed from the extreme value theory in conjunction with the actual-cost data analysis.

Some regulatory-enforcement and policy-making organisations do provide guidance and methodology for data-breach record assembly. For example, the European Union Agency for Network and Information Security (ENISA) recommended a methodology for personal data-breach severity computation. To shore up the automatic notification to the relevant regulatory authorities of the organisation’s controller, ENISA promotes the use of three scoring variables: data processing context, ease of identification, and circumstances of breach (Manson and Gorniak, 2013). The models built on the basis of the US data breaches could be applied to model the non-US data breaches to meet the needs and purpose of data-privacy protection authorities and the insurance industry. However, reliable and publicly accessible databases are still wanting, and as such their creation and continual improvements are necessary to address core issues in cyber-risk assessment and management as well as in cyber insurance valuation.

In this paper, we shall tackle the modelling of data breaches in hospitals and medical systems from two sources: the PRC and HHS. It has to be noted that the PRC data set contains data breaches from multiple sources, mostly from the State Attorneys General and the HHS. The data collection by the PRC was suspended though in 2019. The HHS data source indicated that the PRC data set is only reliable until 2017. Taking into account regulations and timely updates, we also choose the HHS data for the application of our model.

We looked at the US history of putting counter measures against cyber attacks (Wikipedia, Wikipedia). The creation of a department called Cyber Command in the US is a macro initiative in mid-2009 with three prominent events acting as catalysts. The first event was the disruption of electricity power supply across multiple regions due to malicious activities aimed to damage information technology systems in January 2008. The second was marked by a compromised payment processor of an international bank that led to more than 130 fraudulent transactions within 30 min in November 2008. The last event was the data theft reported by the industry in 2008 causing estimated losses of more than one trillion dollars in intellectual property. Laws and regulations have been developed and improved to mitigate cyber threats. These include several Acts legislated as the Health Insurance Portability and Accountability Act, the Homeland Security Act, the Consumer Data Security and Notification Act, and the Securely Protect Yourself Against Cyber Trespass Act. In particular, the government showed its seriousness by criminalising cyber attacks, and striking a balance amongst national security, privacy, and business interests thereby shrinking the number of cyber crimes. The perilousness of cyber attack events prompted the development of cybersecurity at the national level. We observe a significant drop in the number of cyber attacks in 2015 for the PRC data. Coincidentally, on 01 April 2015, former President Obama issued an Executive Order establishing the first-ever economic sanction to freeze the assets of individuals and entities responsible for cyber attacks in response to cyber security breaches in major US businesses and financial institutions including Anthem, Sony Pictures, JPMorgan Chase, and Target (PwC Financial Services Regulatory Practice, 2015).

Of particular relevance motivating the urgent need for resilient cyber measures is the sequence of events on 12 May 2017. There was a worldwide cyber attack by the WannaCry ransomware cryptoworm, which targeted Microsoft Windows operating system users through attackers’ data encryption and a demand for ransom payments in the Bitcoin cryptocurrency. It was found that computers running unsupported versions of Microsoft Windows such as Windows XP and Windows server 2003 were particularly vulnerable due to a lack of security patches. A few hours later, the spread of the attack was halted by the registration of a kill switch discovered by researcher Marcus Hutchins. Immediately after that, there was a process of updating versions of WannaCry ransomware cryptoworms and releasing out-of-band security updates for end-of-life products, more registered kill switches.

A pattern of significant cyber attacks engenders cybersecurity defence, which must be improved further when novel ways of attack are carried out. Additionally, we find that for both data sets, the PRC and HHS data, a preliminary test using the function fpois implemented in R package ‘extRemes’ (Gilleland, 2019) indicates that the daily number of cyber attacks displayed over-dispersion. The quantile-quantile (Q-Q) plots demonstrate that the distribution of inter-arrival times of data breaches has a heavier tail than that of the exponential distribution. Our findings inspire the modelling of data breach incident occurrences utilising the Markov-modulated Poisson process (MMPP) or the Markov-modulated non-homogeneous Poisson process (MMNPP) previously examined in Avanzi et al. (2021). The over-dispersion feature is naturally incorporated into the MMPP and the MMNPP. The distribution of the inter-arrival times also displays heavier tails than the exponentially distributed random numbers. Both the MMPP and MMNPP are more flexible than the homogeneous Poisson process with regime-switching intensity rates. However, which process (i.e., MMPP or MMNPP) to use should be determined by whether the time series of incident arrivals is stationary or not. The MMNPP is a generalised version of the MMPP and the intensity rate of the MMNPP is influenced by a time-varying exposure component. In comparison to the MMPP, the MMNPP is able to handle non-stationary data. The MMPP and MMNPP have ubiquitous applications in various fields. For example, the MMPPs are widely used in the area of internet traffic by accurately approximating the long-range dependence characteristics of the network traffic traces (e.g., Andersen, Nielsen, 1998, Muscariello, Meillia, Meo, Marsan, Cigno, 2004, Salvador, Valadas, Pacheco, 2003). Chang et al. (2011) priced catastrophe equity put options under the MMPP’s framework modelling catastrophic events. Economic demand was modelled by an MMPP in Arts (2017) for which each part of a single inventory location with multiple types of repairable spare parts are kept for service and maintenance of several different fleets of assets. Avanzi et al. (2021) used the MMNPP to describe the auto-insurance claim arrivals. The sightings of marine mammals in shipboard or aerial surveys were modelled by Langrock et al. (2013) employing the MMNPP.

We shall demonstrate that the data breach incident arrivals could be adequately captured by the MMPP for the PRC data and by the MMNPP for the HHS data after processing the arrivals data into batches, i.e., summing the number of incidents every 14 days. We consider the model set up and algorithms based on Avanzi et al. (2021). In contrast to Avanzi et al. (2021), however, our research contribution underscores an exposition of implementation by creating heuristics if necessary, establishing the algorithm’s feasibility in our context, and determining which type of stochastic process to use in accordance with the pre-test results. We explain the inter-arrival-times adjustment, which refers to the 14-day grouping of the data set and assigning equal inter-arrival times within each group. Such an adjustment is needed to make a suitable choice between MMPP or MMNPP for breach-incidents fitting. Various aspects of model validation were also tackled. Our result reveals that the generalised Pareto distribution (GPD) provides the best fit for the breach sizes, which is closely linked and supported by the fundamental principles of extreme value theory. Given the independence of the adjusted inter-arrival times from breach sizes, we do not use the widely applied copula method. Completing the development a cyber insurance product, premiums and values for risk measures were computed numerically.

We structure the remaining parts of this paper as follows. Section 2 presents the preprocessing of the data before they are modelled. In Section 3, the models are formulated for the evolution of the counting process and sizes of data breaches. Detailed steps and implementation results are discussed in Section 4. Section 5 presents the computation and analysis of risk measures and premiums. Some concluding remarks including certain implications to practitioners are given in Section 6.

 

Section snippets

Preliminary data analysis

In this section, we introduce the PRC and HHS data and illustrate the preprocessing of observations, which include choosing the time range of the raw data set and categorisation. A preliminary analysis of daily incident counts and breach sizes is performed as well.

Model description

In Section 2, we emphasised the statistical features of the data-breach incident counts and breach sizes, which are key quantities in the quantification of the data breaches’ severity. We shall present in this Section the rationale and details for the mathematical expression designed to gauge the total breach sizes. Specifically, the breach-sizes total

Abstract
Data breaches have been increasing noticeably after 2021 notwithstanding the efforts of regulatory bodies to strengthen cybersecurity measures to protect health information. We provide a modelling framework that assesses the risk of private health data breaches focusing on the data sets compiled by the Privacy Rights Clearinghouse and the U.S. Department of Health and Human Services. We show that the counting process of the data-breach incidents is adequately modelled by the Markov-modulated non-homogeneous Poisson process (MMNPP) whilst the logarithm of the breach sizes is well-captured by the generalised Pareto distribution. The cyber insurance premium per institution and two risk measures Value-at-Risk (VaR) and Average VaR are obtained. The computed results indicate that cyber insurance policies with longer maturity are more cost effective. A comprehensive analysis, parameter estimation and implementation of the MMNPP to model cyber risks are underscored as the principal contributions of this research. Some implications to practitioners in handling the modelling of data breaches for a group of institutions are given.

Introduction
The digital revolution has led both public and private organisations to virtually rely on electronic information that entails information processing and information technology development. A growing concern about this information reliance and progress in digital technology is the risk associated with computer security, also called cybersecurity or information technology security. Cybersecurity risk, or simply cyber risk, refers to the potential failure of information systems that may cause financial loss, operational disruption, and other related damages. Incidents that compromise the information systems are called cyber attacks, and the most common result of cyber attacks is data breaches by criminals who copy, transmit, view, steal or use sensitive, protected or confidential data (United States Department of Health and Human Services, 2015).

The past and prevailing rise of data breaches has become a clear and significant threat to hospitals and healthcare systems (Compliance Group, 2022). It must be recognised that advances in internet technology (IT) supporting health care information management are signs of societal development and progress. However, an increasing use of platforms for electronic health records (EHRs) (i.e., patients’ complete and up-to-date medical and health history) seems to have brought as well in the rise of data breaches. In 2021, cyber attacks were more rampant as cyber criminals took advantage of the information technology (IT) vulnerability of hospitals and healthcare systems that are too pre-occupied responding to the COVID-19 pandemic (Pino, 2022). The 2020 Health Care Cybersecurity Survey (Healthcare Information and Management Systems Society, 2022) of the Healthcare Information and Management Systems Society (HIMSS), involving 168 U.S.-based industry professionals, asserted that robust cybersecurity is a must for all health care organisations. The U.S. Department of Health and Human Services (HHS) (United States Department of Health and Human Services Office for Civil Rights, 2021b) was created with the mission of enhancing the health and well-being of all Americans. Such a mission could be accomplished through efficient health and human services, and fostering sound, sustainable scientific advancements geared towards medicine, public health, and social services. National standards for securing sensitive patient health information are embodied in the Health Insurance Portability and Accountability (HIPAA) Act of 1996 (Wikipedia, 2022). The HHS made cybersecurity a priority in 2022 and urged HIPAA-covered healthcare entities to patch up security gaps that enable hackers easy access to organisations’ computer servers (Diaz, 2022). Severe penalties for HIPAA violations could be meted out to fortify the security protection of healthcare information. The maximum fine may reach $1.5 million for HIPAA violation due to willful neglect and is not corrected. The largest fine of $5.5 million, for example, was levied against Memorial Healthcare Systems in 2017 for accessing confidential information of 115,143 patients; see Alder (2017). Regulations mandate healthcare providers and entities to notify patients impacted by breaches of protected health information (PHI) if more than 500 individuals are affected. Besides, these reports will be made public to the HHS (United States Department of Health and Human Services Office for Civil Rights, 2021a), and media organisations. With the earliest records tracing back to October 2009, nearly one million individuals recently were affected by the five largest data breaches reported by the HHS in February 2022 (Adams, 2022).

Xie et al. (2020) emphasised the significance of cyber insurance policy design and stated that changes in cyber insurance loss ratios are not driven by the appreciation in premium but by claim-frequency and severity growth. Due to its importance and timeliness, contemporary issues have driven vigorous research for modelling cyber security risk. Unfortunately, there are still limited public resources to test and validate models. In the meantime, researchers in this area implement their frameworks on the data sets collected by the Privacy Rights Clearinghouse (PRC) (2021), which maintains the largest and most extensive publicly available database. A PRC data set was previously studied by Edwards et al. (2016), who concluded that breach sizes could be modelled by the log-normal family of distributions and the daily frequency of breaches could be described by a negative binomial distribution. Eling and Loperfido (2017) used the log-skew-normal distribution to model data breach sizes in conjunction with the use of multidimensional scaling and goodness-of-fit tests. Xu et al. (2018) modelled the inter-arrival times of hacking-data breaches with the autoregressive conditional mean (ACD) model and depicted the breach sizes by the ARMA-GARCH model with the dependence between the incidents and the breach sizes modelled by the Gumbel copula. In Sun et al. (2021), the breach frequency was modelled by a hurdle Poisson model and the breach severity was fitted to a non-parametric generalised Pareto distribution whilst the dependence between the breach frequency and severity was captured by a Gumbel copula. Bessy-Roland et al. (2021) examined the arrival of cyber events and demonstrated the ability of the Hawkes models in pinning down the self-excitation and interactions of data breaches depending on their type and targets. The sparsity of breaches experienced by individual enterprises overtime was circumvented effectively in Fang et al. (2021) by leveraging the inter-entity or inter-enterprise dependence between multiple time series. For an overarching review of cyber-risk modelling and cyber insurance, see Eling (2020); Eling and Schnell (2016) and Zeller and Scherer (2021).

Outside of the US, an increasing trend in the medical data breaches is observed as well. According to the Verizon’s data-breach investigation reports (Verizon, Verizon), based on the pooled worldwide regional incident data, the confirmed data-breach occurrences in the health care industry exhibit an overall increasing pattern for the period 01 November 2017–31 October 2021. Although with sources that are mostly organisations external to Verizon, the incident counts in Verizon’s report are rather small compared to those in the PRC or HHS data set.

One notable example of a data-breach incident outside of the US is the medical-record breach in Singapore. In this case, some 73,000 patients’ records were leaked due to ransomware attacks at an eye clinic on 06 August 2021 (Haworth, 2021). As a result, Singapore enacted a data-breach notification law in 2021 requiring “notifiable” breaches to be reported to the data protection office. Notifiable in this instance means either significant harm was brought to persons whose information was compromised or at least 501 persons were affected by the data breach. Failure to notify the Cybersecurity Commissioner within three calendar days will result to a fine of up to 10% of the organisation’s annual turnover or SGD 1 million ($742,000), whichever is higher.

In Europe, Dedalus Biologie was imposed a fine of 1.5 million euros by the French Lead Supervisory Authority, for a massive medical-data breach impacting nearly 500,000 people on 23 February 2021 (European Data Protection Board, 2022). Given the unstoppable trend of data breaches, 137 out of 194 countries had put in place legislation geared towards data privacy and protection (United Nations Conference on Trade and Development, 2021). Nonetheless, even if many legislative bodies (e.g. the EU parliament) have mandatory data-breach protection decrees, there are still limited sources that could provide well-structured public data sets focusing on worldwide data breaches (Kierkegaard, 2012). In contrast, there has been considerable progress in the US in terms of public availability of information surrounding data-breach sources and their data-collection process. Challenges in constructing a global database for data breaches include amongst others the lack of strong compilation standards and insufficient details on the data sources as pointed out by Neto et al. (2021) and the unknown number of unique data-breach sources. Indeed, limited data contribute to fewer quantitative research works aimed at establishing dependable frameworks to support the modelling of global-data breaches. Within the examination of cyber risk events in which 25% are data breach incidents (Eling and Wirfs, 2019) – utilising the SAS Global OpRisk and PRC data sets – identified “cyber risks of daily life” and “extreme cyber risks”. In Eling and Wirfs (2019), the peaks-over-threshold method was employed from the extreme value theory in conjunction with the actual-cost data analysis.

Some regulatory-enforcement and policy-making organisations do provide guidance and methodology for data-breach record assembly. For example, the European Union Agency for Network and Information Security (ENISA) recommended a methodology for personal data-breach severity computation. To shore up the automatic notification to the relevant regulatory authorities of the organisation’s controller, ENISA promotes the use of three scoring variables: data processing context, ease of identification, and circumstances of breach (Manson and Gorniak, 2013). The models built on the basis of the US data breaches could be applied to model the non-US data breaches to meet the needs and purpose of data-privacy protection authorities and the insurance industry. However, reliable and publicly accessible databases are still wanting, and as such their creation and continual improvements are necessary to address core issues in cyber-risk assessment and management as well as in cyber insurance valuation.

In this paper, we shall tackle the modelling of data breaches in hospitals and medical systems from two sources: the PRC and HHS. It has to be noted that the PRC data set contains data breaches from multiple sources, mostly from the State Attorneys General and the HHS. The data collection by the PRC was suspended though in 2019. The HHS data source indicated that the PRC data set is only reliable until 2017. Taking into account regulations and timely updates, we also choose the HHS data for the application of our model.

We looked at the US history of putting counter measures against cyber attacks (Wikipedia, Wikipedia). The creation of a department called Cyber Command in the US is a macro initiative in mid-2009 with three prominent events acting as catalysts. The first event was the disruption of electricity power supply across multiple regions due to malicious activities aimed to damage information technology systems in January 2008. The second was marked by a compromised payment processor of an international bank that led to more than 130 fraudulent transactions within 30 min in November 2008. The last event was the data theft reported by the industry in 2008 causing estimated losses of more than one trillion dollars in intellectual property. Laws and regulations have been developed and improved to mitigate cyber threats. These include several Acts legislated as the Health Insurance Portability and Accountability Act, the Homeland Security Act, the Consumer Data Security and Notification Act, and the Securely Protect Yourself Against Cyber Trespass Act. In particular, the government showed its seriousness by criminalising cyber attacks, and striking a balance amongst national security, privacy, and business interests thereby shrinking the number of cyber crimes. The perilousness of cyber attack events prompted the development of cybersecurity at the national level. We observe a significant drop in the number of cyber attacks in 2015 for the PRC data. Coincidentally, on 01 April 2015, former President Obama issued an Executive Order establishing the first-ever economic sanction to freeze the assets of individuals and entities responsible for cyber attacks in response to cyber security breaches in major US businesses and financial institutions including Anthem, Sony Pictures, JPMorgan Chase, and Target (PwC Financial Services Regulatory Practice, 2015).

Of particular relevance motivating the urgent need for resilient cyber measures is the sequence of events on 12 May 2017. There was a worldwide cyber attack by the WannaCry ransomware cryptoworm, which targeted Microsoft Windows operating system users through attackers’ data encryption and a demand for ransom payments in the Bitcoin cryptocurrency. It was found that computers running unsupported versions of Microsoft Windows such as Windows XP and Windows server 2003 were particularly vulnerable due to a lack of security patches. A few hours later, the spread of the attack was halted by the registration of a kill switch discovered by researcher Marcus Hutchins. Immediately after that, there was a process of updating versions of WannaCry ransomware cryptoworms and releasing out-of-band security updates for end-of-life products, more registered kill switches.

A pattern of significant cyber attacks engenders cybersecurity defence, which must be improved further when novel ways of attack are carried out. Additionally, we find that for both data sets, the PRC and HHS data, a preliminary test using the function fpois implemented in R package ‘extRemes’ (Gilleland, 2019) indicates that the daily number of cyber attacks displayed over-dispersion. The quantile-quantile (Q-Q) plots demonstrate that the distribution of inter-arrival times of data breaches has a heavier tail than that of the exponential distribution. Our findings inspire the modelling of data breach incident occurrences utilising the Markov-modulated Poisson process (MMPP) or the Markov-modulated non-homogeneous Poisson process (MMNPP) previously examined in Avanzi et al. (2021). The over-dispersion feature is naturally incorporated into the MMPP and the MMNPP. The distribution of the inter-arrival times also displays heavier tails than the exponentially distributed random numbers. Both the MMPP and MMNPP are more flexible than the homogeneous Poisson process with regime-switching intensity rates. However, which process (i.e., MMPP or MMNPP) to use should be determined by whether the time series of incident arrivals is stationary or not. The MMNPP is a generalised version of the MMPP and the intensity rate of the MMNPP is influenced by a time-varying exposure component. In comparison to the MMPP, the MMNPP is able to handle non-stationary data. The MMPP and MMNPP have ubiquitous applications in various fields. For example, the MMPPs are widely used in the area of internet traffic by accurately approximating the long-range dependence characteristics of the network traffic traces (e.g., Andersen, Nielsen, 1998, Muscariello, Meillia, Meo, Marsan, Cigno, 2004, Salvador, Valadas, Pacheco, 2003). Chang et al. (2011) priced catastrophe equity put options under the MMPP’s framework modelling catastrophic events. Economic demand was modelled by an MMPP in Arts (2017) for which each part of a single inventory location with multiple types of repairable spare parts are kept for service and maintenance of several different fleets of assets. Avanzi et al. (2021) used the MMNPP to describe the auto-insurance claim arrivals. The sightings of marine mammals in shipboard or aerial surveys were modelled by Langrock et al. (2013) employing the MMNPP.

We shall demonstrate that the data breach incident arrivals could be adequately captured by the MMPP for the PRC data and by the MMNPP for the HHS data after processing the arrivals data into batches, i.e., summing the number of incidents every 14 days. We consider the model set up and algorithms based on Avanzi et al. (2021). In contrast to Avanzi et al. (2021), however, our research contribution underscores an exposition of implementation by creating heuristics if necessary, establishing the algorithm’s feasibility in our context, and determining which type of stochastic process to use in accordance with the pre-test results. We explain the inter-arrival-times adjustment, which refers to the 14-day grouping of the data set and assigning equal inter-arrival times within each group. Such an adjustment is needed to make a suitable choice between MMPP or MMNPP for breach-incidents fitting. Various aspects of model validation were also tackled. Our result reveals that the generalised Pareto distribution (GPD) provides the best fit for the breach sizes, which is closely linked and supported by the fundamental principles of extreme value theory. Given the independence of the adjusted inter-arrival times from breach sizes, we do not use the widely applied copula method. Completing the development a cyber insurance product, premiums and values for risk measures were computed numerically.

We structure the remaining parts of this paper as follows. Section 2 presents the preprocessing of the data before they are modelled. In Section 3, the models are formulated for the evolution of the counting process and sizes of data breaches. Detailed steps and implementation results are discussed in Section 4. Section 5 presents the computation and analysis of risk measures and premiums. Some concluding remarks including certain implications to practitioners are given in Section 6.

Section snippets
Preliminary data analysis
In this section, we introduce the PRC and HHS data and illustrate the preprocessing of observations, which include choosing the time range of the raw data set and categorisation. A preliminary analysis of daily incident counts and breach sizes is performed as well.

Model description
In Section 2, we emphasised the statistical features of the data-breach incident counts and breach sizes, which are key quantities in the quantification of the data breaches’ severity. We shall present in this Section the rationale and details for the mathematical expression designed to gauge the total breach sizes. Specifically, the breach-sizes total 
 is modelled as
where 
 is the counting process of the data breach incident arrivals and 
 is the logarithmic

Implementation results
Having introduced the complete theoretical framework of the total breach size model, we present empirical illustration using the PRC and HHS data sets previously introduced in Section 2. Together with the necessary preliminary analysis, we implement the MMNPP and the MMPP in the modelling of the counting process for the data breaches described in Section 4.1. The selection of candidate models fitting the breach sizes is presented in Section 4.2.

Insurance premiums and risk metrics for the total dollar breach size losses
Indeed, the total breach sizes 
 can give us an indirect measure of the impact of data breaches. However, in the context of insurance valuation, we shall transform 
 into dollar-amount losses. In particular, risk measures and cyber risk insurance premiums are computed using dollar-amount losses.

To calculate the predicted losses and premiums, we use the MMPP-3 model to simulate the number of incidents and the GPD model to simulate the breach sizes for the PRC data. The models used for the

Conclusion
We analysed the data breaches encountered by the hospitals and health systems via the PRC data and the HHS data. The MMNPP was found to be an adequate model for the counting process of the data breaches after data pre-processing. As the logarithmic breach sizes could be modelled by the GPD, a complete pricing framework was developed. We put forward and improved specific ways of applying the MMNPP to the cybersecurity area, which was started by Avanzi et al. (2021) in the context of automobile

CRediT authorship contribution statement
Yuying Li: Investigation, Methodology, Data curation, Formal analysis, Conceptualization, Software, Validation, Visualization, Writing – original draft. Rogemar Mamon: Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Resources, Writing – review & editing.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) through R. Mamon’s Discovery Grant (RGPIN-2017-04235).

Yuying Li is currently completing the Ph.D. degree in the Department of Statistical and Actuarial Sciences (DSAS) at the University of Western Ontario (UWO). Her research covers statistical methods and stochastic approaches in cyber risk modelling.

Implementation results

Having introduced the complete theoretical framework of the total breach size model, we present empirical illustration using the PRC and HHS data sets previously introduced in Section 2. Together with the necessary preliminary analysis, we implement the MMNPP and the MMPP in the modelling of the counting process for the data breaches described in Section 4.1. The selection of candidate models fitting the breach sizes is presented in Section 4.2.

Insurance premiums and risk metrics for the total dollar breach size losses

Indeed, the total breach sizes can give us an indirect measure of the impact of data breaches. However, in the context of insurance valuation, we shall transform into dollar-amount losses. In particular, risk measures and cyber risk insurance premiums are computed using dollar-amount losses.

To calculate the predicted losses and premiums, we use the MMPP-3 model to simulate the number of incidents and the GPD model to simulate the breach sizes for the PRC data. The models used for the

Conclusion

We analysed the data breaches encountered by the hospitals and health systems via the PRC data and the HHS data. The MMNPP was found to be an adequate model for the counting process of the data breaches after data pre-processing. As the logarithmic breach sizes could be modelled by the GPD, a complete pricing framework was developed. We put forward and improved specific ways of applying the MMNPP to the cybersecurity area, which was started by Avanzi et al. (2021) in the context of automobile

CRediT authorship contribution statement

Yuying Li: Investigation, Methodology, Data curation, Formal analysis, Conceptualization, Software, Validation, Visualization, Writing – original draft. Rogemar Mamon: Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Resources, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) through R. Mamon’s Discovery Grant (RGPIN-2017-04235).

Yuying Li is currently completing the Ph.D. degree in the Department of Statistical and Actuarial Sciences (DSAS) at the University of Western Ontario (UWO). Her research covers statistical methods and stochastic approaches in cyber risk modelling.

 
Yuying Li, Rogemar Mamon,
Modelling health-data breaches with application to cyber insurance,
Computers & Security,
Volume 124,
2023,
102963,
ISSN 0167-4048,
https://doi.org/10.1016/j.cose.2022.102963.
(https://www.sciencedirect.com/science/article/pii/S0167404822003558)