1 / 54

Data Management

Data Management. David Nathan & Peter Austin & Robert Munro. This section. Data management Properties of data Relational data model XML Example. something happened. . representations, lists, summaries, analyses. something inscribed. cleaned up, selected, analysed.

tynice
Download Presentation

Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management David Nathan & Peter Austin & Robert Munro

  2. This section • Data management • Properties of data • Relational data model • XML • Example

  3. something happened  representations, lists, summaries, analyses something inscribed cleaned up, selected, analysed you applied knowledge, made decisions archived, presented, published NOT OF INTEREST! recapitulates  representations, eg transcription, annotation recording you applied knowledge, techniques made decisions, applied linguistic knowledge FOCUS OF INTEREST! archived & ... ?? something happened Workflows - description vs documentation Description Documentation

  4. Data? • What is data? • Documentation data?

  5. What is data management? • using appropriate and shared-standard data encoding methods (e.g. Unicode) • model the data domains (units, processes) • use appropriate and standard data structure methods (= knowledge representation) • capture and document steps, decisions, conventions, structures • consistency • (=machine readable) • awareness of planning and flow of data • working with others and across systems • catering for archiving

  6. Example • eg if you need to collect/compare linguistic material according to speaker, then you need to not only make suitable recordings, but also create suitable labels, metadata, annotation etc

  7. Documenter & archive interactions

  8. Documenter & archive interactions

  9. Choosing values/priorities • Standards & compliance • Adeptness with tools • Modelling of phenomena, architecture of data • Dissemination/publishing • Preserving • Ethics, responsibility, protocol • Range, comprehensiveness • Intellectual rigour • Which are priorities? • Which are dispensible?

  10. A (thought) provoking example • \Indigenous title < > • \English title <The angry daughter> • \Language <Betta Kurumba> • \Duration <0:11:56> • \Description <A story about a king who had six sons and one daughter. The daughter, who has supernatural powers, gets angry with the family over an injust act done to her and runs away from the family, concealing herself as a spirit living in a well.> • \Rec_date <1999-11-20> • \Rec_location <Theppakkadu, Tamil Nadu, India> • \Indigenous speaker <B. Badsi, wife of BNHS Bomman> • \Collector <Gail Coelho> • \Genre <Folktale> • \Tape_medium <DAT> • \media_file <AngryD-Bsi.wav> • \annotation_file <AngryD-Bsi.pdf>

  11. A (thought) provoking example • 1. What is this? • 2. Where would it be found? • 3. What is it for? • 4. Why does it look like this? • 5. List two good points about it. • 6. List at least 2 bad points about it

  12. Data should be: • explicit • consistent • robust • meaningful • conventional • adaptable, convertible, machine readable etc • useful!

  13. Where do word processors fit in? • MS Word is not a good data management tool - although if used well, it can play a role. • Fragment of an attention-seeking Look at me! MS Word file Underlying RTF: \pard\plain \s55\widctlpar \f4 Fragment of an attention-seeking {\b Look at me!} MS Word file \par \pard\plain • dual representations • WYSIWYG or WYSIAYG • distinguish structure and representation from presentation • ambiguities of typography: possible solution with styles • how would styles work in an RTF doc?

  14. “Portability” • Bird and Simons 2003: language documentation data needs to have integrity, flexibility, longevity

  15. “Portability” • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific • (also appropriate, accurate, useful etc!!)

  16. Data management • the way that data is structured is also information, that may be complex • properly structured data allows: • usage including manipulation, conversion, derivation • preservation • machine readability

  17. Data management systems • a data management system is a system you design for storing data and metadata: • information about content and structures • relationship between units of information • it is not necessarily tied to any particular software, or even a computer

  18. Naive managment using filenames • a (too) simple management system: • information about a recording is captured in the filenames: 1st_int_john_5Aug.wav market_conv_mj.wav …. • what does ‘int’ mean? • what information about the recording is missing?

  19. Data modeling • World/universe • Domain • Relevant • entities • properties • relationships • We also need formal ways to represent these

  20. Data modeling • data modelling is the process of designing your data management system: • what information do you need to record? • what are the units of information? • what are their properties (attributes)? • what are the relationships between the units of information? • how is the information etc likely to change in the future? • how can all this be represented?

  21. Data management • two well-known formats for structured data: • relational database • eXtensible Markup Language (XML) • these are methods, not softwares or hardwares • any system for well-structured data could be OK, but generally: • smaller community of users so less tools and support • ... so errors more likely

  22. Databases • Note that database has 3 senses: • a body of related information • type of software (eg Oracle, Access, Filemaker) • a model for the domain of information (ie. formulation of entities and relationships)

  23. Relational format • Uses tables • Table rows represent entities in a domain • Table columns represent properties/attributes of entities • Each cell represents one atomic unit of data • The order of rows and columns has no significance

  24. TABLE NAME field name Representing a relational design • simplest example

  25. Representing a relational design • less trivial entity TABLE NAME field 1 field 2

  26. CONTINENT name COUNTRY name Representing a relational design • less trivial domain = one to many

  27. AUTHOR ..... SUBJECT name ..... name Non-trivial domains • non-trivial domains have many-to-many relationships

  28. From model to implementation • implementing table relationships CONTINENT COUNTRY name name id id continent_id

  29. Designing a database • Determine the domain, entities and relationships • Experiment with scenarios • Any non-trivial model will evolve as it is thought out and tested • Normalisation is the process of refining models

  30. Practical example • Create a database model to record bicycle owners • Populate your database with 3 bicycle owners: • Alf • Betty • Cherie

  31. Extending ... • Cater for the brands of bicycle they own: • Alf Dawes • Betty Giant • Cherie Malvern Star

  32. Testing ... • Dennis also has a Malvern Star

  33. Testing ... • Alf has two bicycles

  34. Simple relational example • don’t need to pack information into filenames: 1st_int_john_5Aug.wav market_conv_mj.wav • use a table in MS Word, Excel, Filemaker etc

  35. Structured data management • some information is about the data • some is about relationships between data

  36. Structured data management • a separate table should define these codes

  37. Structured data management • formalise the relationships within the data: • need unique identifiers

  38. Structured data management • formalise the relationships within the data: • need unique identifiers

  39. DBMS software also handles • entry • value checking • deletion • manipulation • querying

  40. DBMS software • most use the ‘tables and keys’ model described here: • MS Access, Oracle, MySQL, Filemaker • they differ in what they additionally offer: • user interfaces (MS Access) • scalability, enforcement of data integrity (Oracle) • free-cost (MySQL) • easily manipulated (Filemaker)

  41. What does all this achieve? • conceptual/intellectual validity • scalable, searchable, modular • machine readable • in fact, portable: • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific

  42. XML

  43. XML history • XML came out of SGML - a system for incremental and collaborative “enrichment” of texts • XML design principles • 1. XML shall be straightforwardly usable over the Internet. • 2. XML shall support a wide variety of applications. • 3. XML shall be compatible with SGML. • 4. It shall be easy to write programs which process XML documents. • 5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. • 6. XML documents should be human-legible and reasonably clear. • 7. The XML design should be prepared quickly. • 8. The design of XML shall be formal and concise. • 9. XML documents shall be easy to create. • 10. Terseness is of minimal importance.

  44. XML • An in-line markup system • Single sequence of text only (but can be unicode) • Reserved characters < > & " ‘ • Tag syntax • Entities syntax • Elements

  45. Definitions • What is an XML document? An XML document consists of sequences and hierarchies of elements and text • What does XML do? XML is a method for expressing languages (knowledge representation languages)

  46. Like HTML, except • emphasis on logical structure, not display properties • encourages human readability • documents must be well formed • no predefined elements - open and extensible

  47. XML concepts • XML can be thought of as: • as a stream (eg: a stream of text) and/or • as a tree structure

  48. Elements • XML is way of creating structures or “elements” using only plain text • elements are written via tags in angle brackets: eg: <noun> • tags are usually in pairs: • a start/open tag, and an end/close tag: the <noun> dog </ noun> chased ... • but can also be single and closed: the dog <pause /> sat down

  49. Attributes • tags can have attributes with values: the <noun num=“1”> dog </ noun> sat down • you can name your tags, attributes or values (almost) anything • there are some restrictions: • you can have hierarchies, but not overlaps: <a>the<b><c>cat</c> sat</b> on the mat</a> <a>the<b><c>cat</b> sat</c> on the mat</a>

  50. Creating XML documents • You need to design/define/model the domain • Your design is a grammar of a particular XML document • The grammar can be expressed: • with the data representation • independently, using a DTD or an XML schema

More Related