Technology and Infrastructure Support for Large Scale Information Marcio Faerman The Brazilian National Education and Research Network - RNP [email protected] www.rnp.br Generating Large Data Collections • Large Data Volumes can be generated much faster than they can be analyzed – Instrument Observations • • • • Particle Accelerators (Cern LHC) Telescopes, Satellites Sensor Networks Virtual Observatories – Large Model Simulations • High resolution, Very complex • Scientific Experiments – – – – – – medical imaging (fMRI): Bio-informatics queries: Satellite world imagery: Current particle physics: LHC physics (2007): LSST Astronomy (2012): ~ 1 GByte per measurement (day) 500 GByte per database ~ 5 TByte/year 1 PByte per year 10-30 PByte per year 5 PBytes per year Challenges Managing Large Volume Data • Scalability – What works for small datasets does not necessarily work for large collections • Data Integrity – At a terabyte scale failures and data corruption are very likely to occur – Is data provenance reliable? • Efficiency – Data should be accessed at a rate which keeps work feasible – More data – need for more speed • Distributed Access – Data can be at remote (and possibly unknown) location • Infrastructure Management – – – – Heterogeneous Distributed Prone to failures Very Complex Challenges – Getting to Know your Data • Extract knowledge from raw data files – Data product derivation • • • • Vizualization Relationships Patterns New derived quantities – Cross institutional and cross disciplinary collaborations • What if experiments – Your data with our model? • Dataset Access – Multiple formats • Each sensor, simulation has its own storage format – Federated collections – Discovery by content Technological Response • Integration of compute, communication, storage and instrument resources into a powerful infrastructure – Information Grids – Very powerful infrastructure – Economy of scale • Serves broad range of customers – biologists, pysicists, government, industry • Infrastructure is heterogeneous, distributed, very complex • Middleware and Data Oriented tools act as facilitators to tackle data management complexities Open Access and Preservation Functionalities • Federated Digital Libraries – – – – Integration of distributed repositories Access control – can decide who can see it Organize the data in collections Describe your data – Metadata • Data Grids – Access to efficient parallel I/O systems – Hierarchical Systems • Disk caches, tapes • Often Distributed – – – – Analysis, Data Mining Visualization Workflow based systems Transaction based data ingestion • Data provenance, Data fingerprinting – What if virtual lab • End User Oriented Portals – "I deal with the data in the way it makes sense to me" Middlewares and Tools • Data Management – – – – – Storage Resource Broker (SRB) Globus Data Management L-Store IBP Storage Resource Manager (SRM) • Data Representation Libraries – HDF5 – NetCDF • Portals – OGCE – JSR 168 Today’s Reality • Exceptional achievements by early adopters • Integration between domain scientists – data users and producers still a challenge – Need much more cross-disciplinary interaction • Emphasis on scale and performance • Failures are still a taboo – Frustration factor should be addressed in partnership with users – Focus on failure recovery and quality of service getting more attention Grid Initiatives around the World e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007 9 UNAM OurGrid EELA SINAPAD SPRACE HEPGrid Ringrid CL Grid UCRAV Networking in Latin America CUDI-MX REACCIUN-VE RAAP-PE RNP-BR REUNA-CL Brazilian National Research And Education Network - RNP • In November 2005 the RNP networking infrastructure was entirely renovated. It consists of • A multigigabit core connecting 10 capitals at 2.5 and 10 Gbps • Connections at 34 Mbps to 11 capitals • Connections up to 16 Mbps to 6 capitals 12 Communitary Metropolitan Networks • It is not enough to bring high speed connectivity to each city – it is necessary bring it to the university campus / research lab as well. • The metropolitan network is the solution – Infrastructure sharing to support: • Campi interconnection of each partner institution • Access to RNP national network backbone – This sharing substantially reduces deployment costs – Preferably, the infrastructure will be owned by the partners themselves (reducing operating costs) • Pilot: The Metrobel project in the city of Belém do Pará in the Amazon region Infra-estrutura para e-Ciência 13 Metrobel – Belém Metropolitan Network Redecomep Project(2005-7) • Following Metrobel, Brazilian Ministry of Science and Technology is supporting the Communitary Networks for Education and Research (Redecomep) Project, with a R$ 39,7 M (~ U$ 19,0 M) through Finep (dec/2004) • Goals: – Extend the metropolitan optical network to other 26 cities with RNP points of presence – Promote integration in metropolitan area – High speed access to RNP point of presence Infra-estrutura para e-Ciência 15 Next steps • Integration between network, data repositories, compute, storage resources and applications – Identify who needs better connectivity – Developing Brazilian cyberinfrastructure – Generally uncoordinated funding for infrastructure resources – Need broad vision at funding agencies and partners level of application requirements and cyberinfrastructure integration • RNP articulating with scientific communities and infrastructure providers e-Science/Infrastructure initiative in Brazil JRU- Brazil: 22 members in EELA-2 # STATE INSTITUTION E-SCIENCE COMMUNITIES 1 SP CCE / USP (e-INFRASTRUCTURE only) 2 RJ CEFET-RJ e-GOVERNMENT, E-INDUSTRY 3 RJ FCM / UERJ BIOMED 4 RJ FIOCRUZ BIOMED, e-EDUCATION 5 SP IAG / USP CLIMATE 6 RJ IME BIOMED 7 SP INCOR / USP BIOMED 8 SP INPE CLIMATE 9 RJ LNCC BIOMED 10 RJ ON PHYSICS 11 BR RNP (NREN) (e-INFRASTRUCTURE only) 12 SP SPRACE / UNESP PHYSICS 13 PB UFCG CLIMATE, EARTH-SCIENCE 14 RJ UFF (e-INFRASTRUCTURE only) 15 MG UFJF BIOMED 16 MS UFMS BIOMED 17 RS UFRGS CLIMATE 18 RJ UFRJ (coordinator for EELA-2) BIOMED, PHYSICS, e-EDUCATION, CLIMATE 19 RS UFSM CLIMATE 20 DF UnB BIOMED 21 RJ UNILASALLE e-EDUCATION 22 SP UNISANTOS BIOMED, E-LEARNING, e-GOVERNMENT e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007 17 Developing Together • Information infrastructure is being redefined in Brazil and Latin America • Now is the time to have as much cross-disciplinary interaction as possible to define needs, partnerships and investments • Please contact us THANK YOU!