您的当前位置:首页正文

Effect of spatial resolution on cluster detection a simulation study

来源:个人技术集锦
InternationalJournalofHealthGeographics

ThisProvisionalPDFcorrespondstothearticleasitappeareduponacceptance.FullyformattedPDFandfulltext(HTML)versionswillbemadeavailablesoon.Effectofspatialresolutiononclusterdetection:asimulationstudy

InternationalJournalofHealthGeographics2007,6:52

doi:10.1186/1476-072X-6-52

AlOzonoff(aozonoff@bu.edu)

CarolineJeffery(cjeffery@hsph.harvard.edu)JustinManjourides(jmanjour@hsph.harvard.edu)

LauraForsbergWhite(lfwhite@bu.edu)MarcelloPagano(pagano@hsph.harvard.edu)

ISSN

ArticletypeSubmissiondateAcceptancedatePublicationdate

ArticleURL

1476-072XMethodology7August200727November200727November2007

http://www.ij-healthgeographics.com/content/6/1/52Thispeer-reviewedarticlewaspublishedimmediatelyuponacceptance.Itcanbedownloaded,

printedanddistributedfreelyforanypurposes(seecopyrightnoticebelow).

ArticlesinIJHGarelistedinPubMedandarchivedatPubMedCentral.

ForinformationaboutpublishingyourresearchinIJHGoranyBioMedCentraljournal,goto

http://www.ij-healthgeographics.com/info/instructions/ForinformationaboutotherBioMedCentralpublicationsgoto

http://www.biomedcentral.com/©2007Ozonoffetal.,licenseeBioMedCentralLtd.

ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),

whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.

Effectofspatialresolutiononclusterdetection:asimulationstudy

AlOzonoff1,2,CarolineJeffery2,JustinManjourides2,LauraForsbergWhite1,2,MarcelloPagano∗2

1Department

ofBiostatistics,BostonUniversitySchoolofPublicHealth,715AlbanyStreet,Boston,MA02118,U.S.A.

2DepartmentofBiostatistics,HarvardSchoolofPublicHealth,655HuntingtonAvenue,Boston,MA02115,U.S.A.

Email:AlOzonoff-aozonoff@bu.edu;CarolineJeffery-cjeffery@hsph.harvard.edu;JustinManjourides-jmanjour@fas.harvard.edu;LauraForsbergWhite-lfwhite@bu.edu;MarcelloPagano∗-pagano@hsph.harvard.edu;

∗Corresponding

author

Abstract

Background:Aggregationofspatialdataisintendedtoprotectprivacy,butsomeeffectsofaggregationon

spatialmethodshavenotyetbeenquantified.

Methods:Wegenerated3,000spatialdatasetsandevaluatedpowerofdetectionat12differentlevelsof

aggregationusingthespatialscanstatisticimplementedinSaTScanv6.0.

Results:Powertodetectclustersdecreasedfromnearly100%whenusingexactlocationstoroughly40%atthe

coarsestlevelofspatialresolution.

Conclusions:Aggregationhasthepotentialforobfuscation.

1Introduction

TheCentersforDiseaseControlandPrevention(CDC)definesurveillancetobetheongoing,systematiccollection,analysis,interpretation,anddisseminationofdataaboutahealth-relatedeventforuseinpublichealthactiontoreducemorbidityandmortalityandtoimprovehealth[1].Tocontrolandpreventdisease,itissurelyimportanttobevigilantforinfectiousdiseaseoutbreaksorgeographicareasofnotablyhigh

1

chronicdiseaseincidence.Indeedthisisaprimaryaimofpublichealthsurveillance,andexplainsinpartwhysurveillanceplaysanintegralroleinpublichealthpractice[2].

Whencaringforasinglepatient,theclinicianunderstandablydesiresasmuchdiagnosticinformationaspossible,andatthehighestpossiblelevelofprecision.Analogously,apublichealthprofessionalisconcernedwithdiagnosingapublicailment,andshouldsimilarlydesireallavailableinformationwiththegreatestpossiblelevelofprecision.Thusitisnoteworthy,inthecontextofpublichealthsurveillance,thatforreasonsofprivacy,informationissometimesdestroyedorintentionallydegradedbeforebeingprofferedtotheanalyst.

Theargumenttoprotectpatientdataforreasonsofprivacycouldalsobeusedtoshieldthesedatafromclinicians.Inaclinicalsetting,wechoosenottoprotecttheprivacyofthepatientbyhidingrelevantinformationfromtheclinician,becauseitispatentlysillytodoso.However,weoftensufferfromasimilarlyframedargumenttoobscurepopulationleveldata,evenwhenaddressingmattersofconcerntothepublichealth.

Wearguethatoneimportantreasontoretainimportant,specificinformationsuchaspreciselocationisthatthe“requisite”aggregationforprivacynecessarilyreducesthepoweravailableforoutbreakdetection.Tobalancethecostofthisandothertroublesforspatialanalysis[3],aggregationdoesindeedmakeitmoredifficulttoidentifyindividualpatients.Thisiscrucialifthedataaremadepubliclyavailableorifthereareotherreasonstosafeguardprivacy,butitalsomakesanalreadychallengingsurveillancetaskevenmoredifficult.

Agrowingbodyofliteratureaddressesstatisticalprotectionofprivacyanditseffectsonanalysisofsurveillancedata.Coxhaswrittenausefulsurveyofthegeneralproblemofconfidentialitywithinsmallgeographicareas,andtheimpactsofprivacyconcernsonpublichealthpolicyandpractice[4].Armstrongetal.thoroughlydiscussthedesignandimplementationofseveraldifferentapproachestoprotectprivacyinthecontextofspatialanalyses[5].Importantly,methodswereevaluatedbothontheimpactonanalysisaswellastheeffectivenessofpreservingconfidentiality.YettherestrictionofthequantitativeassessmenttotheCuzick-Edwardsteststatistic[6],whichisnolongercommonlyusedforspatialsurveillance[7,8],limitstheapplicationofthisknowledgetoasurveillancesetting.Further,datawithexactlocationswerenotconsideredforthisevaluation.

Wallerandcolleagueshavewrittenextensivelyonfactorsthatmayinfluencepowerofclusterdetectionmethods.Forexample,theyhavestudiedtheeffectsofgeographicscaleonfocusedtestsofclustering[9,10],andtheimportanceofclusterlocationamidstaheterogeneousunderlyingpopulation[11].Notably,this

2

grouphasinvestigatedmorethanonestatisticalmethod,usingseveraldifferentmeasuresforevaluation.Howeverthesestudiesgenerallyusefocusedtestsofclustering,whereaputativeexposuresourcehasbeenidentifiedapriori,whereassurveillancepurposestypicallyrequireageneraltestofclustering[12].Justaswetrustcliniciansandhospitalpersonnelwithsensitiveandconfidentialinformation,sotoo,onecanargue,weshouldfindtrustworthyindividualstohandlesurveillancedataresponsibly.Informatics-basedapproachesofferapotentialcompromisetothetrade-offbetweenprivacyand

surveillanceutility.Forexample,developmentofautomatedsurveillancealgorithmsmightallowsensitivedatatobeanalyzedwithouthumanintervention[13].Butinordertoevaluatethebenefitthatsuchanapproachmightprovide,wemustfirstbetterunderstandthecostsinperformancethattheobfuscationordestructionofinformationmaycause.

Wereportedbriefly[14]thatthereisanundesirablelossofpowertodetectdiseaseoutbreakswhenthespatialinformationprovidedisdegradedfromacontinuousscaleofmeasurementtoacoarser,aggregatelevel.Forexample,oftenonlyapatient’sZIPcodeisavailabletoasurveillancesystem,insteadofthepatient’slistedresidentialaddress.Similarresultshaveappearedincontemporaneouswork[15],andarecentpaperbythesamegroupfurtherconfirmsthisbasicpremise[16].However,thosestudiesfocusedsolelyonexactlocationscomparedtoasinglelevelofaggregation.

Inourpresentwork,weaddtothesepreviousresultsbyconsideringmultiplelevelsofaggregation.Usingsyntheticdata,wesystematicallyquantifythelossofclusterdetectionperformanceasafunctionofspatialresolution,whilelimitingconfoundinginfluencesfromavarietyofcomplexfactorsthataffectspatialanalyses.Wemayinterprettheseresultsrelativetogeographicscaleswemightencounterwhilesurveillingalargemetropolitancity.Inthisway,weattempttoclarifythepriceonepaysforaggregation,andinturntobetterinformfuturepolicydecision-makers.

2

2.1

Methods

Data

Wedesignedasimulationstudytodeterminetheeffectofspatialaggregationonpowertodetectspatialclusters.Randomsamplesofsize90weredrawnfromanunderlyinguniformdistributionontheunitdisk(i.e.theEuclideancircleofradiusone).Atopthisbackgroundsample,wethensuperimposeasimulatedclusterconsistingof10pointsuniformlydistributedinasmallsquareatalocationrandomlydeterminedforeachsimulateddataset(Figure1).Thuseachsimulateddatasetconsistsofatotalsampleof100points.Althoughtheclustersarenotdefinedbycircles,foreaseofdiscussionwespeakofacluster

3

“radius”tomeantheradiusofthecircleinscribedwithinthesquareclusterboundary.Intheoccasionalinstancewheretheclustercenterfallswithinoneradiusoftheunitdiskboundary,werequirethatall10clusterpointslaywithintheintersectionoftheclusterboundaryandtheunitdisk.

Wegeneratedthreeseparatesetsofsimulateddatawithclusterradiiof0.025,0.05and0.10,correspondingtodiseaseclusterswithageographicalextentequalto2.5%,5%,or10%respectivelyoftheradiusofthestudyarea.Althoughthisresultsinclustersofdifferentintensities,thecorrespondingrelativerisksarequitelarge(greaterthan10)forallsimulations.Foreachclusterradius,wegenerate1,000datasetsundertheseconditions,oratotalof3,000datasetsfortheentiresimulationstudy.

Tosimulatespatialaggregationatdifferentgeographicscales,weuseasequenceof12uniformgridsofvaryingspacing,superimposedontheunitdisk.Thelevelsofaggregationarechosenaccordingtotheircorrespondinggridspacing,rangingfrom15gridsquaresperside(lengthofgridsquare0.067)tofourgridsquaresperside(lengthofgridsquare0.25).Throughout,weusetheaveragedistancebetweengridpoints(equivalently,theaveragediameterofanaggregationregion)asanindexofthelevelofspatialaggregation(Figure2).

Byassigningallsimulateddatapointstothenearestgridpoint,thesegridstherebydefinespatialregionsofaggregation.Priortoanalysis,wemodifiedeachgridbyaddingsmallamountsofbivariatejittertoeachgridpoint(i.e.regioncenter).Ourpurposewastomitigatethehighdegreeofspatialregularityacrossauniformgridofassignmentpoints,andinparttoreflectthenon-uniformnatureofadministrativeregionsastheyappearinrealsystems.Wenotehoweverthattheuseofauniformpopulationdistributionimpliesconstantpopulationdensitiesacrossadministrativeregion,somethingunlikelytobeseeninarealsystem.2.2

Statisticalanalysis

WeuseSaTScanversion6.0(2005)withapurelyspatialBernoullimodel,withclustersizeconstrainedtobenogreaterthan25%ofthepopulation.StatisticalsignificanceofspatialclustersisdeterminedusinganominalTypeIerrorrateof0.05.

Ourprimaryoutcomeistheproportionofsimulateddatasets,undereachlevelofaggregation,forwhichSaTScanaccuratelydetectsthesimulatedcluster.Wedenotethisproportionasthepowertodetectclusters.InordertoensurethattheclusterdetectedbySaTScanissufficientlycloseinspacetothetrueclusterlocation,werecordadetectionassuccessfulifandonlyiftheidentifiedclustercenteriswithinoneclusterradiusofthetrueclustercenter.Wealsorecordtheproportionoffalsedetections,definedasanyclusteridentificationwithcentermorethanoneclusterradiusfromthetrueclustercenter,orfailureofany

4

identifiedclustertoachievesignificancelevel(i.e.p-value)below0.05.

Tomeasurethespatialaccuracyofclusterdetection,wefurtherconsidertheidentification(correctlyornot)ofindividualdatapointsinasignificantdiseasecluster.Withineachsimulateddataset,therewere10pointsof100thatcomprisedthesimulatedcluster.Forthese“clusterpoints”,wecalculatetheproportioncorrectlyincludedinaSaTScan-identifiedclusterwithp-valuebelow0.05.Similarlyfortheremaining90“non-clusterpoints”,wecalculatetheproportionincorrectlyincludedinastatisticallysignificantSaTScan-identifiedcluster.Theseproportionsareanalogoustotraditionaldefinitionsofsensitivityand1minusspecificity,respectively,wherewecomparetheclassificationviaSaTScanofpointsinvolvedinaclustertothe“goldstandard”ofclusterstatusasdeterminedbysimulationdesign.

3Results

Figures3through6illustratesourresults.Forallthreesetsofsimulations,powerdecreasesasthesizeofaggregationregionsincreases.Thesesimulatedclustersaresufficientlylargesothatthepowertodetectforallthreeclusterradiiisnearly100%whenexactlocationsareused;thisdecreasestoroughly40%atthecoarsestlevelofaggregation,whichcorrespondstoamorethanhalvingoftheprobabilityofsuccessfuldetection(Figure3).

Usingexactlocations,thefalsedetectionrateisapproximately2%.Inthepresenceofanylevelofaggregation,thefalsedetectionrateincreasestonearly20%orhigherinallofoursimulations(Figure4).Thisrateappearstoincreaseslowlyforgreaterlevelsofaggregation.

WefurtherevaluatetheeffectofaggregationonthesensitivityandspecificityofSaTScan(Figures5and6).Whileperformanceisnearlyidealwhenusingexactlocations,theproportionoffalsenegativesrisestoalmost50%atthecoarsestlevelofaggregation.Inconcordancewithourearlierresults,sensitivitytendstodecreaseasspatialaggregationincreases,whilethefalsepositivefraction(1minusspecificity)followsaninverseandnearlymonotonicassociation.

4Discussion

Ourresultsarenoteworthyforanumberofreasons.First,wehaveusedmorethantwolevelsofaggregationinanefforttoestimatetheincrementaleffectofthisaggregationonthepowerofclusterdetection.Second,wehavefurtherinvestigatedtheeffectofaggregationontherateoffalsedetection.Finally,whenviewedinthecontextofsimilarstudies,ourresultsaddtoabodyofevidencethattheunderlyingrelationshipsreportedappearrobusttodifferinggeographiesandpopulationdistributions.

5

Ourcalculationofpowerandfalsedetectiondiffersfromthesamemeasuresasotherwiseusedinanimportantway.Weexpectacertainproportionofspurious“clusters”toarisebychancealone.Thuswehaveplacedanadditionalrequirementonwhatwedenoteasuccessfulidentificationofacluster,namelythattheidentifiedclusterbeproximaltothetrueclusterasdeterminedbythesimulationdesign.Becauseoursimulationsinvolveonlyoneclusterperdataset,anidentificationfarfromthetrueclusterisgenuinelyspuriousandmustbeconsideredafalsedetectioninthiscontext.Indeed,forpracticalpurposessuchanidentificationmightdivertresourcesforinvestigationtoageographicareanotrelatedtothetrueoutbreakorclusterpresentinthedata.

Toplaceourresultsincontext,considerthemetropolitanBostonarea.Thecityandadjacentsuburbscanbeenclosedinacircleofradiusroughly7,500meters.AlthoughthesizeofcityZIPcodesandcensustractsvaries,anapproximatemedianradiusforBostonZIPcodesisroughly1,500meters,or20%oftheregionradius.Bostoncensustractshaveanapproximatemedianradiusof500meters,or6.7%oftheregionradius.ThuscensustractandZIPcodeaggregationofBostondatacorrespondsroughlytoourfirstandpenultimatelevelsofaggregationrespectively.Likewise,thesimulatedclustersofradii0.025,0.05,and0.10correspondtodiseaseoutbreakssmallerthanonecensustract,aboutonecensustract,orseveralcensustracts(perhapsasmallZIPcode)respectively.

Thenumberoffalsedetectionsrosewellabovethenominalalphalevelwhenspatialdatawereaggregated.Interestingly,thelevelofaggregationdoesnotappeartobeamajorcontributortofalsealarms;rather,thereisanimmediateincreaseuponaggregationabovethenominalfalsealarmrate,withlittleadditionalincreaseforfurtheraggregation.Toourknowledge,thishasnotbeenreportedpreviously.Sincefalsealarmsformamajorlimitationtotheactionableconsequencesofclusterdetection,thisissueshouldbeconsideredcarefully.Eveninsituationswherelossofpowerisnotsevere,theincreaseinfalsedetectionratesmayimposefurtherlimitsoftheutilityofspatialmethodswhenusingaggregateddata.Ourstudyislimitedinseveralways.WehaveonlyincludedanevaluationofSaTScanasatestofclustering,althoughwehaveseensimilarresultsusingothermethods[14].Theuseofsyntheticdataisbothhelpfulandharmfultogeneralizabilityofresults.Therearefewpopulationsthatevenapproximateahomogeneousanduniformdistribution,andthusthesimulateddatasetsdonotreflectarealisticsurveillancescenario.However,usingahomogeneousdistributionremovessomeofthepotentially

confoundinginteractionsbetweenclusterlocation,geography,populationdistribution,andspatialmethods.Thusdespiteitslimitations,ourstudycontributestoanunderstandingofthecomplexassociationbetweenspatialresolutionandpowerofdetection.

6

Wechosenottoinvestigatespatio-temporalmethods(implementedforexamplewithaspace-timescan,alsoavailableusingSaTScan).Space-timeinteractionsimplygreatercomplexitywhenconsideringeffectsofspatialaggregation(orindeed,temporalaggregation),andthepotentialparameterspaceofsimulationstudiesincreasesgreatlyaswell.Forthisandotherreasons,theeffectofspatialaggregation(orindeed,temporalaggregation)inaclusterdetectioncontextremainsanareaforfurtherinvestigation.

Competinginterests

Theauthorsdeclarethattheyhavenocompetinginterests.

Authorscontributions

AOandMPconceivedofthestudy,participatedinthedesign,anddraftedthemanuscript.AO,CJ,andJMwereresponsibleforstatisticalprogramminganddataanalysis.Allauthorsreadandapprovedthefinalmanuscripts.

Acknowledgements

ResearchpartiallysupportedbyNIHgrantsR01-AI51164andR01-EB006195.

References

1.TeutschSM,ChurchillRE:PrinciplesandPracticeofPublicHealthSurveillance.OxfordUnivPress2000.2.BrookmeyerR,StroupD(Eds):MonitoringtheHealthofPopulations:Statisticalprinciplesandmethodsforpublichealthsurveillance.OxfordUnivPress2004.3.GrubesicT,MatisziwT:OntheuseofZIPcodesandZIPcodetabulationareas(ZCTAs)forthespatialanalysisofepidemiologicaldata.IntlJHealthGeographics2006,5:1–15.4.CoxL:Protectingconfidentialityinsmallpopulationhealthandenvironmentalstatistics.StatisticsinMedicine1996,15:1895–1905.5.ArmstrongM,RushtonG,ZimmermanD:Geographicallymaskinghealthdatatopreserveconfidentiality.StatisticsinMedicine1999,18:497–525.

6.CuzickJ,EdwardsR:Spatialclusteringforinhomogeneouspopulations.JRoyalStatistSocB1990,52:73–104.7.WallerL,GotwayC:AppliedSpatialStatisticsforPublicHealthData.Wiley2004.8.LawsonA:StatisticalMethodsinSpatialEpidemiology2ed.Wiley2006.

9.WallerL,LawsonA:Thepoweroffocusedteststodetectdiseaseclustering.StatisticsinMedicine1995,14:2291–2308.10.WallerL:Statisticalpoweranddesignoffocusedclusteringstudies.StatisticsinMedicine1996,

15:765–782.11.WallerL,HillE,RuddR:Thegeographyofpower:Statisticalperformanceoftestsofclustersand

clusteringinheterogeneouspopulations.StatisticsinMedicine2006,25:853–865.12.LawsonA,KleinmanK(Eds):SpatialandSyndromicSurveillanceforPublicHealth.Wiley2005.

7

13.BoulosM,CaiQ,PadgetJ,RushtonG:Usingsoftwareagentstopreserveindividualhealthdata

confidentialityinmicro-scalegeographicanalyses.JournalofBiomedicalInformatics2006,39:160–170.14.JefferyC,OzonoffA,ForsbergL,NunoM,PaganoM:Thecostofobfuscationwhenreportinglocations

ofcasesinsyndromicsurveillancesystems.AdvancesinDiseaseSurveillance2006,1:36.15.CassaC,GrannisS,OverhageJ,MandlK:Anovel,context-sensitiveapproachtoanonymizingspatial

surveillancedata:impactonoutbreakdetection.AdvancesinDiseaseSurveillance2006,1:10.16.OlsonK,GrannisS,MandlK:Privacyprotectionversusclusterdetectioninspatialepidemiology.

AmJPublicHealth2006,96:2002–2008.

Figures

Figure1-Illustrationofasimulatedcluster

90pointsweredistributeduniformlyontheunitcircle,and10additional“outbreak”pointsformthesquare“cluster”leftofcenter.

Figure2-Illustrationofspatialaggregation

Oneof12levelsofspatialaggregationusedinthisstudy.Gridlinesdefinespatialregionsofaggregation,andrepresentativepointsarechosenrandomlywithineachregion.Allsimulatedpointsarereassignedtotherepresentativepointoftheappropriateregion.Figure3-Effectofaggregationonpower

Asspatialdataareaggregated,powertodetectclustersdecreases.Horizontalaxisdenoteslevelofspatialaggregation,determinedbyradiusofaggregationregion;verticalaxisdenotesproportionofsimulatedclusterscorrectlyidentifiedatsignificancelevelα=0.05.Figure4-Effectofaggregationonfalsedetectionrate

Verticalaxisdenotesproportionofsimulationswherespuriousclustersaredetected.Figure5-Effectofaggregationonsensitivity

Identificationofcasesinvolvedinanoutbreakbecomesmoredifficultasdataareaggregated.Verticalaxisdenotesproportionofcasesfalselyidentifiedasoutsidethediseasecluster(falsenegatives).Figure6-Effectofaggregationonspecificity

Verticalaxisdenotesproportionofcasesfalselyidentifiedasinsidethecluster(falsepositives).

8

−1.0−1.0

−0.50.00.51.0−0.50.00.51.0

Figure 1

−1.0−1.0

−0.50.00.51.0−0.50.00.51.0

Figure 2

Power0.20.40.60.81.00.0Figure 3Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.10.00

0.05

0.10

0.15

0.20

0.25

Length of Side of Grid Square

0.4FDR0.00.10.20.3Figure 4Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.1

0.000.050.100.150.200.25

Length of Side of Grid Square

0.6False Negative Rate0.00.10.20.30.40.5Figure 5Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.1

0.000.050.100.150.200.25

Length of Side of Grid Square

0.20False Positive Rate0.000.050.100.15Figure 6Cluster Radius = 0.025Cluster Radius = 0.05Cluster Radius = 0.1

0.000.050.100.150.200.25

Length of Side of Grid Square

因篇幅问题不能全部显示,请点此查看更多更全内容