ThisProvisionalPDFcorrespondstothearticleasitappeareduponacceptance.FullyformattedPDFandfulltext(HTML)versionswillbemadeavailablesoon.Effectofspatialresolutiononclusterdetection:asimulationstudy
InternationalJournalofHealthGeographics2007,6:52
doi:10.1186/1476-072X-6-52
AlOzonoff(aozonoff@bu.edu)
CarolineJeffery(cjeffery@hsph.harvard.edu)JustinManjourides(jmanjour@hsph.harvard.edu)
LauraForsbergWhite(lfwhite@bu.edu)MarcelloPagano(pagano@hsph.harvard.edu)
ISSN
ArticletypeSubmissiondateAcceptancedatePublicationdate
ArticleURL
1476-072XMethodology7August200727November200727November2007
http://www.ij-healthgeographics.com/content/6/1/52Thispeer-reviewedarticlewaspublishedimmediatelyuponacceptance.Itcanbedownloaded,
printedanddistributedfreelyforanypurposes(seecopyrightnoticebelow).
ArticlesinIJHGarelistedinPubMedandarchivedatPubMedCentral.
ForinformationaboutpublishingyourresearchinIJHGoranyBioMedCentraljournal,goto
http://www.ij-healthgeographics.com/info/instructions/ForinformationaboutotherBioMedCentralpublicationsgoto
http://www.biomedcentral.com/©2007Ozonoffetal.,licenseeBioMedCentralLtd.
ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),
whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
Effectofspatialresolutiononclusterdetection:asimulationstudy
AlOzonoff1,2,CarolineJeffery2,JustinManjourides2,LauraForsbergWhite1,2,MarcelloPagano∗2
1Department
ofBiostatistics,BostonUniversitySchoolofPublicHealth,715AlbanyStreet,Boston,MA02118,U.S.A.
2DepartmentofBiostatistics,HarvardSchoolofPublicHealth,655HuntingtonAvenue,Boston,MA02115,U.S.A.
Email:AlOzonoff-aozonoff@bu.edu;CarolineJeffery-cjeffery@hsph.harvard.edu;JustinManjourides-jmanjour@fas.harvard.edu;LauraForsbergWhite-lfwhite@bu.edu;MarcelloPagano∗-pagano@hsph.harvard.edu;
∗Corresponding
author
Abstract
Background:Aggregationofspatialdataisintendedtoprotectprivacy,butsomeeffectsofaggregationon
spatialmethodshavenotyetbeenquantified.
Methods:Wegenerated3,000spatialdatasetsandevaluatedpowerofdetectionat12differentlevelsof
aggregationusingthespatialscanstatisticimplementedinSaTScanv6.0.
Results:Powertodetectclustersdecreasedfromnearly100%whenusingexactlocationstoroughly40%atthe
coarsestlevelofspatialresolution.
Conclusions:Aggregationhasthepotentialforobfuscation.
1Introduction
TheCentersforDiseaseControlandPrevention(CDC)definesurveillancetobetheongoing,systematiccollection,analysis,interpretation,anddisseminationofdataaboutahealth-relatedeventforuseinpublichealthactiontoreducemorbidityandmortalityandtoimprovehealth[1].Tocontrolandpreventdisease,itissurelyimportanttobevigilantforinfectiousdiseaseoutbreaksorgeographicareasofnotablyhigh
1
chronicdiseaseincidence.Indeedthisisaprimaryaimofpublichealthsurveillance,andexplainsinpartwhysurveillanceplaysanintegralroleinpublichealthpractice[2].
Whencaringforasinglepatient,theclinicianunderstandablydesiresasmuchdiagnosticinformationaspossible,andatthehighestpossiblelevelofprecision.Analogously,apublichealthprofessionalisconcernedwithdiagnosingapublicailment,andshouldsimilarlydesireallavailableinformationwiththegreatestpossiblelevelofprecision.Thusitisnoteworthy,inthecontextofpublichealthsurveillance,thatforreasonsofprivacy,informationissometimesdestroyedorintentionallydegradedbeforebeingprofferedtotheanalyst.
Theargumenttoprotectpatientdataforreasonsofprivacycouldalsobeusedtoshieldthesedatafromclinicians.Inaclinicalsetting,wechoosenottoprotecttheprivacyofthepatientbyhidingrelevantinformationfromtheclinician,becauseitispatentlysillytodoso.However,weoftensufferfromasimilarlyframedargumenttoobscurepopulationleveldata,evenwhenaddressingmattersofconcerntothepublichealth.
Wearguethatoneimportantreasontoretainimportant,specificinformationsuchaspreciselocationisthatthe“requisite”aggregationforprivacynecessarilyreducesthepoweravailableforoutbreakdetection.Tobalancethecostofthisandothertroublesforspatialanalysis[3],aggregationdoesindeedmakeitmoredifficulttoidentifyindividualpatients.Thisiscrucialifthedataaremadepubliclyavailableorifthereareotherreasonstosafeguardprivacy,butitalsomakesanalreadychallengingsurveillancetaskevenmoredifficult.
Agrowingbodyofliteratureaddressesstatisticalprotectionofprivacyanditseffectsonanalysisofsurveillancedata.Coxhaswrittenausefulsurveyofthegeneralproblemofconfidentialitywithinsmallgeographicareas,andtheimpactsofprivacyconcernsonpublichealthpolicyandpractice[4].Armstrongetal.thoroughlydiscussthedesignandimplementationofseveraldifferentapproachestoprotectprivacyinthecontextofspatialanalyses[5].Importantly,methodswereevaluatedbothontheimpactonanalysisaswellastheeffectivenessofpreservingconfidentiality.YettherestrictionofthequantitativeassessmenttotheCuzick-Edwardsteststatistic[6],whichisnolongercommonlyusedforspatialsurveillance[7,8],limitstheapplicationofthisknowledgetoasurveillancesetting.Further,datawithexactlocationswerenotconsideredforthisevaluation.
Wallerandcolleagueshavewrittenextensivelyonfactorsthatmayinfluencepowerofclusterdetectionmethods.Forexample,theyhavestudiedtheeffectsofgeographicscaleonfocusedtestsofclustering[9,10],andtheimportanceofclusterlocationamidstaheterogeneousunderlyingpopulation[11].Notably,this
2
grouphasinvestigatedmorethanonestatisticalmethod,usingseveraldifferentmeasuresforevaluation.Howeverthesestudiesgenerallyusefocusedtestsofclustering,whereaputativeexposuresourcehasbeenidentifiedapriori,whereassurveillancepurposestypicallyrequireageneraltestofclustering[12].Justaswetrustcliniciansandhospitalpersonnelwithsensitiveandconfidentialinformation,sotoo,onecanargue,weshouldfindtrustworthyindividualstohandlesurveillancedataresponsibly.Informatics-basedapproachesofferapotentialcompromisetothetrade-offbetweenprivacyand
surveillanceutility.Forexample,developmentofautomatedsurveillancealgorithmsmightallowsensitivedatatobeanalyzedwithouthumanintervention[13].Butinordertoevaluatethebenefitthatsuchanapproachmightprovide,wemustfirstbetterunderstandthecostsinperformancethattheobfuscationordestructionofinformationmaycause.
Wereportedbriefly[14]thatthereisanundesirablelossofpowertodetectdiseaseoutbreakswhenthespatialinformationprovidedisdegradedfromacontinuousscaleofmeasurementtoacoarser,aggregatelevel.Forexample,oftenonlyapatient’sZIPcodeisavailabletoasurveillancesystem,insteadofthepatient’slistedresidentialaddress.Similarresultshaveappearedincontemporaneouswork[15],andarecentpaperbythesamegroupfurtherconfirmsthisbasicpremise[16].However,thosestudiesfocusedsolelyonexactlocationscomparedtoasinglelevelofaggregation.
Inourpresentwork,weaddtothesepreviousresultsbyconsideringmultiplelevelsofaggregation.Usingsyntheticdata,wesystematicallyquantifythelossofclusterdetectionperformanceasafunctionofspatialresolution,whilelimitingconfoundinginfluencesfromavarietyofcomplexfactorsthataffectspatialanalyses.Wemayinterprettheseresultsrelativetogeographicscaleswemightencounterwhilesurveillingalargemetropolitancity.Inthisway,weattempttoclarifythepriceonepaysforaggregation,andinturntobetterinformfuturepolicydecision-makers.
2
2.1
Methods
Data
Wedesignedasimulationstudytodeterminetheeffectofspatialaggregationonpowertodetectspatialclusters.Randomsamplesofsize90weredrawnfromanunderlyinguniformdistributionontheunitdisk(i.e.theEuclideancircleofradiusone).Atopthisbackgroundsample,wethensuperimposeasimulatedclusterconsistingof10pointsuniformlydistributedinasmallsquareatalocationrandomlydeterminedforeachsimulateddataset(Figure1).Thuseachsimulateddatasetconsistsofatotalsampleof100points.Althoughtheclustersarenotdefinedbycircles,foreaseofdiscussionwespeakofacluster
3
“radius”tomeantheradiusofthecircleinscribedwithinthesquareclusterboundary.Intheoccasionalinstancewheretheclustercenterfallswithinoneradiusoftheunitdiskboundary,werequirethatall10clusterpointslaywithintheintersectionoftheclusterboundaryandtheunitdisk.
Wegeneratedthreeseparatesetsofsimulateddatawithclusterradiiof0.025,0.05and0.10,correspondingtodiseaseclusterswithageographicalextentequalto2.5%,5%,or10%respectivelyoftheradiusofthestudyarea.Althoughthisresultsinclustersofdifferentintensities,thecorrespondingrelativerisksarequitelarge(greaterthan10)forallsimulations.Foreachclusterradius,wegenerate1,000datasetsundertheseconditions,oratotalof3,000datasetsfortheentiresimulationstudy.
Tosimulatespatialaggregationatdifferentgeographicscales,weuseasequenceof12uniformgridsofvaryingspacing,superimposedontheunitdisk.Thelevelsofaggregationarechosenaccordingtotheircorrespondinggridspacing,rangingfrom15gridsquaresperside(lengthofgridsquare0.067)tofourgridsquaresperside(lengthofgridsquare0.25).Throughout,weusetheaveragedistancebetweengridpoints(equivalently,theaveragediameterofanaggregationregion)asanindexofthelevelofspatialaggregation(Figure2).
Byassigningallsimulateddatapointstothenearestgridpoint,thesegridstherebydefinespatialregionsofaggregation.Priortoanalysis,wemodifiedeachgridbyaddingsmallamountsofbivariatejittertoeachgridpoint(i.e.regioncenter).Ourpurposewastomitigatethehighdegreeofspatialregularityacrossauniformgridofassignmentpoints,andinparttoreflectthenon-uniformnatureofadministrativeregionsastheyappearinrealsystems.Wenotehoweverthattheuseofauniformpopulationdistributionimpliesconstantpopulationdensitiesacrossadministrativeregion,somethingunlikelytobeseeninarealsystem.2.2
Statisticalanalysis
WeuseSaTScanversion6.0(2005)withapurelyspatialBernoullimodel,withclustersizeconstrainedtobenogreaterthan25%ofthepopulation.StatisticalsignificanceofspatialclustersisdeterminedusinganominalTypeIerrorrateof0.05.
Ourprimaryoutcomeistheproportionofsimulateddatasets,undereachlevelofaggregation,forwhichSaTScanaccuratelydetectsthesimulatedcluster.Wedenotethisproportionasthepowertodetectclusters.InordertoensurethattheclusterdetectedbySaTScanissufficientlycloseinspacetothetrueclusterlocation,werecordadetectionassuccessfulifandonlyiftheidentifiedclustercenteriswithinoneclusterradiusofthetrueclustercenter.Wealsorecordtheproportionoffalsedetections,definedasanyclusteridentificationwithcentermorethanoneclusterradiusfromthetrueclustercenter,orfailureofany
4
identifiedclustertoachievesignificancelevel(i.e.p-value)below0.05.
Tomeasurethespatialaccuracyofclusterdetection,wefurtherconsidertheidentification(correctlyornot)ofindividualdatapointsinasignificantdiseasecluster.Withineachsimulateddataset,therewere10pointsof100thatcomprisedthesimulatedcluster.Forthese“clusterpoints”,wecalculatetheproportioncorrectlyincludedinaSaTScan-identifiedclusterwithp-valuebelow0.05.Similarlyfortheremaining90“non-clusterpoints”,wecalculatetheproportionincorrectlyincludedinastatisticallysignificantSaTScan-identifiedcluster.Theseproportionsareanalogoustotraditionaldefinitionsofsensitivityand1minusspecificity,respectively,wherewecomparetheclassificationviaSaTScanofpointsinvolvedinaclustertothe“goldstandard”ofclusterstatusasdeterminedbysimulationdesign.
3Results
Figures3through6illustratesourresults.Forallthreesetsofsimulations,powerdecreasesasthesizeofaggregationregionsincreases.Thesesimulatedclustersaresufficientlylargesothatthepowertodetectforallthreeclusterradiiisnearly100%whenexactlocationsareused;thisdecreasestoroughly40%atthecoarsestlevelofaggregation,whichcorrespondstoamorethanhalvingoftheprobabilityofsuccessfuldetection(Figure3).
Usingexactlocations,thefalsedetectionrateisapproximately2%.Inthepresenceofanylevelofaggregation,thefalsedetectionrateincreasestonearly20%orhigherinallofoursimulations(Figure4).Thisrateappearstoincreaseslowlyforgreaterlevelsofaggregation.
WefurtherevaluatetheeffectofaggregationonthesensitivityandspecificityofSaTScan(Figures5and6).Whileperformanceisnearlyidealwhenusingexactlocations,theproportionoffalsenegativesrisestoalmost50%atthecoarsestlevelofaggregation.Inconcordancewithourearlierresults,sensitivitytendstodecreaseasspatialaggregationincreases,whilethefalsepositivefraction(1minusspecificity)followsaninverseandnearlymonotonicassociation.
4Discussion
Ourresultsarenoteworthyforanumberofreasons.First,wehaveusedmorethantwolevelsofaggregationinanefforttoestimatetheincrementaleffectofthisaggregationonthepowerofclusterdetection.Second,wehavefurtherinvestigatedtheeffectofaggregationontherateoffalsedetection.Finally,whenviewedinthecontextofsimilarstudies,ourresultsaddtoabodyofevidencethattheunderlyingrelationshipsreportedappearrobusttodifferinggeographiesandpopulationdistributions.
5
Ourcalculationofpowerandfalsedetectiondiffersfromthesamemeasuresasotherwiseusedinanimportantway.Weexpectacertainproportionofspurious“clusters”toarisebychancealone.Thuswehaveplacedanadditionalrequirementonwhatwedenoteasuccessfulidentificationofacluster,namelythattheidentifiedclusterbeproximaltothetrueclusterasdeterminedbythesimulationdesign.Becauseoursimulationsinvolveonlyoneclusterperdataset,anidentificationfarfromthetrueclusterisgenuinelyspuriousandmustbeconsideredafalsedetectioninthiscontext.Indeed,forpracticalpurposessuchanidentificationmightdivertresourcesforinvestigationtoageographicareanotrelatedtothetrueoutbreakorclusterpresentinthedata.
Toplaceourresultsincontext,considerthemetropolitanBostonarea.Thecityandadjacentsuburbscanbeenclosedinacircleofradiusroughly7,500meters.AlthoughthesizeofcityZIPcodesandcensustractsvaries,anapproximatemedianradiusforBostonZIPcodesisroughly1,500meters,or20%oftheregionradius.Bostoncensustractshaveanapproximatemedianradiusof500meters,or6.7%oftheregionradius.ThuscensustractandZIPcodeaggregationofBostondatacorrespondsroughlytoourfirstandpenultimatelevelsofaggregationrespectively.Likewise,thesimulatedclustersofradii0.025,0.05,and0.10correspondtodiseaseoutbreakssmallerthanonecensustract,aboutonecensustract,orseveralcensustracts(perhapsasmallZIPcode)respectively.
Thenumberoffalsedetectionsrosewellabovethenominalalphalevelwhenspatialdatawereaggregated.Interestingly,thelevelofaggregationdoesnotappeartobeamajorcontributortofalsealarms;rather,thereisanimmediateincreaseuponaggregationabovethenominalfalsealarmrate,withlittleadditionalincreaseforfurtheraggregation.Toourknowledge,thishasnotbeenreportedpreviously.Sincefalsealarmsformamajorlimitationtotheactionableconsequencesofclusterdetection,thisissueshouldbeconsideredcarefully.Eveninsituationswherelossofpowerisnotsevere,theincreaseinfalsedetectionratesmayimposefurtherlimitsoftheutilityofspatialmethodswhenusingaggregateddata.Ourstudyislimitedinseveralways.WehaveonlyincludedanevaluationofSaTScanasatestofclustering,althoughwehaveseensimilarresultsusingothermethods[14].Theuseofsyntheticdataisbothhelpfulandharmfultogeneralizabilityofresults.Therearefewpopulationsthatevenapproximateahomogeneousanduniformdistribution,andthusthesimulateddatasetsdonotreflectarealisticsurveillancescenario.However,usingahomogeneousdistributionremovessomeofthepotentially
confoundinginteractionsbetweenclusterlocation,geography,populationdistribution,andspatialmethods.Thusdespiteitslimitations,ourstudycontributestoanunderstandingofthecomplexassociationbetweenspatialresolutionandpowerofdetection.
6
Wechosenottoinvestigatespatio-temporalmethods(implementedforexamplewithaspace-timescan,alsoavailableusingSaTScan).Space-timeinteractionsimplygreatercomplexitywhenconsideringeffectsofspatialaggregation(orindeed,temporalaggregation),andthepotentialparameterspaceofsimulationstudiesincreasesgreatlyaswell.Forthisandotherreasons,theeffectofspatialaggregation(orindeed,temporalaggregation)inaclusterdetectioncontextremainsanareaforfurtherinvestigation.
Competinginterests
Theauthorsdeclarethattheyhavenocompetinginterests.
Authorscontributions
AOandMPconceivedofthestudy,participatedinthedesign,anddraftedthemanuscript.AO,CJ,andJMwereresponsibleforstatisticalprogramminganddataanalysis.Allauthorsreadandapprovedthefinalmanuscripts.
Acknowledgements
ResearchpartiallysupportedbyNIHgrantsR01-AI51164andR01-EB006195.
References
1.TeutschSM,ChurchillRE:PrinciplesandPracticeofPublicHealthSurveillance.OxfordUnivPress2000.2.BrookmeyerR,StroupD(Eds):MonitoringtheHealthofPopulations:Statisticalprinciplesandmethodsforpublichealthsurveillance.OxfordUnivPress2004.3.GrubesicT,MatisziwT:OntheuseofZIPcodesandZIPcodetabulationareas(ZCTAs)forthespatialanalysisofepidemiologicaldata.IntlJHealthGeographics2006,5:1–15.4.CoxL:Protectingconfidentialityinsmallpopulationhealthandenvironmentalstatistics.StatisticsinMedicine1996,15:1895–1905.5.ArmstrongM,RushtonG,ZimmermanD:Geographicallymaskinghealthdatatopreserveconfidentiality.StatisticsinMedicine1999,18:497–525.
6.CuzickJ,EdwardsR:Spatialclusteringforinhomogeneouspopulations.JRoyalStatistSocB1990,52:73–104.7.WallerL,GotwayC:AppliedSpatialStatisticsforPublicHealthData.Wiley2004.8.LawsonA:StatisticalMethodsinSpatialEpidemiology2ed.Wiley2006.
9.WallerL,LawsonA:Thepoweroffocusedteststodetectdiseaseclustering.StatisticsinMedicine1995,14:2291–2308.10.WallerL:Statisticalpoweranddesignoffocusedclusteringstudies.StatisticsinMedicine1996,
15:765–782.11.WallerL,HillE,RuddR:Thegeographyofpower:Statisticalperformanceoftestsofclustersand
clusteringinheterogeneouspopulations.StatisticsinMedicine2006,25:853–865.12.LawsonA,KleinmanK(Eds):SpatialandSyndromicSurveillanceforPublicHealth.Wiley2005.
7
13.BoulosM,CaiQ,PadgetJ,RushtonG:Usingsoftwareagentstopreserveindividualhealthdata
confidentialityinmicro-scalegeographicanalyses.JournalofBiomedicalInformatics2006,39:160–170.14.JefferyC,OzonoffA,ForsbergL,NunoM,PaganoM:Thecostofobfuscationwhenreportinglocations
ofcasesinsyndromicsurveillancesystems.AdvancesinDiseaseSurveillance2006,1:36.15.CassaC,GrannisS,OverhageJ,MandlK:Anovel,context-sensitiveapproachtoanonymizingspatial
surveillancedata:impactonoutbreakdetection.AdvancesinDiseaseSurveillance2006,1:10.16.OlsonK,GrannisS,MandlK:Privacyprotectionversusclusterdetectioninspatialepidemiology.
AmJPublicHealth2006,96:2002–2008.
Figures
Figure1-Illustrationofasimulatedcluster
90pointsweredistributeduniformlyontheunitcircle,and10additional“outbreak”pointsformthesquare“cluster”leftofcenter.
Figure2-Illustrationofspatialaggregation
Oneof12levelsofspatialaggregationusedinthisstudy.Gridlinesdefinespatialregionsofaggregation,andrepresentativepointsarechosenrandomlywithineachregion.Allsimulatedpointsarereassignedtotherepresentativepointoftheappropriateregion.Figure3-Effectofaggregationonpower
Asspatialdataareaggregated,powertodetectclustersdecreases.Horizontalaxisdenoteslevelofspatialaggregation,determinedbyradiusofaggregationregion;verticalaxisdenotesproportionofsimulatedclusterscorrectlyidentifiedatsignificancelevelα=0.05.Figure4-Effectofaggregationonfalsedetectionrate
Verticalaxisdenotesproportionofsimulationswherespuriousclustersaredetected.Figure5-Effectofaggregationonsensitivity
Identificationofcasesinvolvedinanoutbreakbecomesmoredifficultasdataareaggregated.Verticalaxisdenotesproportionofcasesfalselyidentifiedasoutsidethediseasecluster(falsenegatives).Figure6-Effectofaggregationonspecificity
Verticalaxisdenotesproportionofcasesfalselyidentifiedasinsidethecluster(falsepositives).
8
−1.0−1.0
−0.50.00.51.0−0.50.00.51.0
Figure 1
−1.0−1.0
−0.50.00.51.0−0.50.00.51.0
Figure 2
Power0.20.40.60.81.00.0Figure 3Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.10.00
0.05
0.10
0.15
0.20
0.25
Length of Side of Grid Square
0.4FDR0.00.10.20.3Figure 4Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.1
0.000.050.100.150.200.25
Length of Side of Grid Square
0.6False Negative Rate0.00.10.20.30.40.5Figure 5Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.1
0.000.050.100.150.200.25
Length of Side of Grid Square
0.20False Positive Rate0.000.050.100.15Figure 6Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.1
0.000.050.100.150.200.25
Length of Side of Grid Square
因篇幅问题不能全部显示,请点此查看更多更全内容