Technology and Infrastructure Support for
Large Scale Information
Marcio Faerman
The Brazilian National Education and Research Network - RNP
[email protected]
www.rnp.br
Generating Large Data Collections
• Large Data Volumes can be generated much faster
than they can be analyzed
– Instrument Observations
•
•
•
•
Particle Accelerators (Cern LHC)
Telescopes, Satellites
Sensor Networks
Virtual Observatories
– Large Model Simulations
• High resolution, Very complex
• Scientific Experiments
–
–
–
–
–
–
medical imaging (fMRI):
Bio-informatics queries:
Satellite world imagery:
Current particle physics:
LHC physics (2007):
LSST Astronomy (2012):
~ 1 GByte per measurement (day)
500 GByte per database
~ 5 TByte/year
1 PByte per year
10-30 PByte per year
5 PBytes per year
Challenges
Managing Large Volume Data
•
Scalability
– What works for small datasets does not necessarily work for large collections
•
Data Integrity
– At a terabyte scale failures and data corruption are very likely to occur
– Is data provenance reliable?
•
Efficiency
– Data should be accessed at a rate which keeps work feasible
– More data – need for more speed
•
Distributed Access
– Data can be at remote (and possibly unknown) location
•
Infrastructure Management
–
–
–
–
Heterogeneous
Distributed
Prone to failures
Very Complex
Challenges – Getting to Know your
Data
• Extract knowledge from raw data files
– Data product derivation
•
•
•
•
Vizualization
Relationships
Patterns
New derived quantities
– Cross institutional and cross disciplinary collaborations
• What if experiments
– Your data with our model?
• Dataset Access
– Multiple formats
• Each sensor, simulation has its own storage format
– Federated collections
– Discovery by content
Technological Response
• Integration of compute, communication, storage and
instrument resources into a powerful infrastructure –
Information Grids
– Very powerful infrastructure
– Economy of scale
• Serves broad range of customers
– biologists, pysicists, government, industry
• Infrastructure is heterogeneous, distributed, very
complex
• Middleware and Data Oriented tools act as facilitators
to tackle data management complexities
Open Access and Preservation
Functionalities
• Federated Digital Libraries
–
–
–
–
Integration of distributed repositories
Access control – can decide who can see it
Organize the data in collections
Describe your data – Metadata
• Data Grids
– Access to efficient parallel I/O systems
– Hierarchical Systems
• Disk caches, tapes
• Often Distributed
–
–
–
–
Analysis, Data Mining
Visualization
Workflow based systems
Transaction based data ingestion
• Data provenance, Data fingerprinting
– What if virtual lab
• End User Oriented Portals
– "I deal with the data in the way it makes sense to me"
Middlewares and Tools
• Data Management
–
–
–
–
–
Storage Resource Broker (SRB)
Globus Data Management
L-Store
IBP
Storage Resource Manager (SRM)
• Data Representation Libraries
– HDF5
– NetCDF
• Portals
– OGCE
– JSR 168
Today’s Reality
• Exceptional achievements by early adopters
• Integration between domain scientists – data users
and producers still a challenge
– Need much more cross-disciplinary interaction
• Emphasis on scale and performance
• Failures are still a taboo
– Frustration factor should be addressed in partnership with
users
– Focus on failure recovery and quality of service getting more
attention
Grid Initiatives around the World
e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007
9
UNAM
OurGrid
EELA
SINAPAD
SPRACE
HEPGrid
Ringrid
CL Grid
UCRAV
Networking in Latin America
CUDI-MX
REACCIUN-VE
RAAP-PE
RNP-BR
REUNA-CL
Brazilian National Research And
Education Network - RNP
•
In November 2005 the
RNP networking
infrastructure was
entirely renovated.
It consists of
• A multigigabit core
connecting 10 capitals
at 2.5 and 10 Gbps
• Connections at 34
Mbps to 11 capitals
• Connections up to
16 Mbps to 6 capitals
12
Communitary Metropolitan Networks
• It is not enough to bring high speed connectivity to each
city – it is necessary bring it to the university campus /
research lab as well.
• The metropolitan network is the solution
– Infrastructure sharing to support:
• Campi interconnection of each partner institution
• Access to RNP national network backbone
– This sharing substantially reduces deployment costs
– Preferably, the infrastructure will be owned by the partners
themselves (reducing operating costs)
• Pilot: The Metrobel project in the city of Belém do Pará in
the Amazon region
Infra-estrutura para e-Ciência
13
Metrobel – Belém Metropolitan
Network
Redecomep Project(2005-7)
• Following Metrobel, Brazilian Ministry of Science and
Technology is supporting the Communitary Networks for
Education and Research (Redecomep) Project, with a
R$ 39,7 M (~ U$ 19,0 M) through Finep (dec/2004)
• Goals:
– Extend the metropolitan optical network to other
26 cities with RNP points of presence
– Promote integration in metropolitan area
– High speed access to RNP point of presence
Infra-estrutura para e-Ciência
15
Next steps
• Integration between network, data repositories,
compute, storage resources and applications
– Identify who needs better connectivity
– Developing Brazilian cyberinfrastructure
– Generally uncoordinated funding for infrastructure resources
– Need broad vision at funding agencies and partners level of
application requirements and cyberinfrastructure integration
• RNP articulating with scientific communities and
infrastructure providers e-Science/Infrastructure
initiative in Brazil
JRU- Brazil: 22 members in EELA-2
#
STATE
INSTITUTION
E-SCIENCE COMMUNITIES
1
SP
CCE / USP
(e-INFRASTRUCTURE only)
2
RJ
CEFET-RJ
e-GOVERNMENT, E-INDUSTRY
3
RJ
FCM / UERJ
BIOMED
4
RJ
FIOCRUZ
BIOMED, e-EDUCATION
5
SP
IAG / USP
CLIMATE
6
RJ
IME
BIOMED
7
SP
INCOR / USP
BIOMED
8
SP
INPE
CLIMATE
9
RJ
LNCC
BIOMED
10
RJ
ON
PHYSICS
11
BR
RNP (NREN)
(e-INFRASTRUCTURE only)
12
SP
SPRACE / UNESP
PHYSICS
13
PB
UFCG
CLIMATE, EARTH-SCIENCE
14
RJ
UFF
(e-INFRASTRUCTURE only)
15
MG
UFJF
BIOMED
16
MS
UFMS
BIOMED
17
RS
UFRGS
CLIMATE
18
RJ
UFRJ (coordinator for EELA-2)
BIOMED, PHYSICS, e-EDUCATION, CLIMATE
19
RS
UFSM
CLIMATE
20
DF
UnB
BIOMED
21
RJ
UNILASALLE
e-EDUCATION
22
SP
UNISANTOS
BIOMED, E-LEARNING, e-GOVERNMENT
e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007
17
Developing Together
• Information infrastructure is being redefined in Brazil
and Latin America
• Now is the time to have as much cross-disciplinary
interaction as possible to define needs, partnerships
and investments
• Please contact us
THANK YOU!
Download

mfaerman