1 / 30

Core Information Processing Technologies Technical Presentation & Demos

Core Information Processing Technologies Technical Presentation & Demos. Miha Gr čar ( Dep artment of Knowledge Technologies, Jožef Stefan Institute ) Achim Klein (University of Hohenheim). Technical WPs. WP1 & WP8. WP2 & WP7. Architecture, Integration & Scaling Strategy. UC#1

joylyn
Download Presentation

Core Information Processing Technologies Technical Presentation & Demos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Core Information Processing TechnologiesTechnical Presentation & Demos Miha Grčar(DepartmentofKnowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Luxembourg, November 2011

  2. Technical WPs WP1 & WP8 WP2 & WP7 Architecture, Integration & Scaling Strategy UC#1 Market Surveillance UC#2 Reputational Risk management UC#3 Online Retail Brokerage Domain-independent GUI (Open Source) WP3 WP4 WP6 Management Dissemination & Exploitation Data Acquisition Data Acquisition Ontology Infrastructure Information Extraction Sentiment Analysis Decision Support Infrastructure We are here WP5 Information Integration Data, Information & Knowledge Base WP9 WP10 FIRST Y1 Review Meeting

  3. Data acquisition pipeline (Dacq) Boilerplate remover Boilerplate remover Boilerplate remover ZeroMQ emitter Semantic annotator Duplicate detector Natural language preproc. Language detector Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Language detector Duplicate detector Semantic annotator ZeroMQ emitter Natural language preproc. Loadbalancing RSS reader RSS reader . . . . . . processingpipelines RSS reader One readerper site FIRST Y1 Review Meeting

  4. Data acquisition pipeline (Dacq) Demo video (3:20) FIRST Y1 Review Meeting

  5. Data acquisition pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator ZeroMQ emitter Semantic annotator Language detector ZeroMQ emitter Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. RSS reader RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting

  6. Boilerplate removal Demo video (1:30)

  7. Data acquisition pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator ZeroMQ emitter Semantic annotator Language detector ZeroMQ emitter Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. RSS reader RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting

  8. Language detection • Motivation: language-specific text analysis components • Relatively simple problem • Solutions based on word or character sequences (language models) • Side effects: removes “garbage” and can be used to identify code page • Our implementation based on frequencies of character sequences Demo video (0:45) FIRST Y1 Review Meeting

  9. Data acquisition pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator ZeroMQ emitter Semantic annotator Language detector ZeroMQ emitter Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. RSS reader RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting

  10. Near-duplicate detection • Why is this a difficult problem? • We are dealing with millions of documents – cannot afford to compare every document with every document • We are also looking for near-duplicates, not only exact matches • Overlooked boilerplate “produces” false near-duplicates Demo video (1:00) FIRST Y1 Review Meeting

  11. Near-duplicate detection • Existing approaches like SimHash, shingling and sketching, SpotSigs… • Apart from SpotSigs, they require “clean” documents • Hard to interpret similarity value (how many characters, words, sentences?) • Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework FIRST Y1 Review Meeting

  12. Technical WPs WP1 & WP8 WP2 & WP7 Architecture, Integration & Scaling Strategy UC#1 Market Surveillance UC#2 Reputational Risk management UC#3 Online Retail Brokerage Domain-independent GUI (Open Source) WP3 WP4 WP6 Management Dissemination & Exploitation Information Extraction Data Acquisition Ontology Infrastructure Ontology Infrastructure Information Extraction Sentiment Analysis Decision Support Infrastructure We are here WP5 Information Integration Data, Information & Knowledge Base WP9 WP10 FIRST Y1 Review Meeting

  13. FIRST ontology SentimentObject FinancialInstrument Index  Stock_Index Stock Company Country Countries Companies Constituents (stocks) Seedindices FIRST Y1 Review Meeting

  14. FIRST ontology :NASDAQ_100 a :Stock_Index ; rdfs:label "NASDAQ-100" . :MICROSOFT a :Stock ; rdfs:label "MICROSOFT CORP COM USD0.00000625" ; :memberOf :NASDAQ_100 . :MICROSOFT_CORP a :Company ; rdfs:label "Microsoft Corp." ; :issues :MICROSOFT . :USA a :Country ; rdfs:label "USA" . :MICROSOFT_CORP :locatedIn :USA . :MICROSOFT_CORP :hasGazetteer :MICROSOFT_CORP_Gazetteer . :MICROSOFT_CORP_Gazetteer :hasTerm "Microsoft Corp" ; :hasTerm "Microsoft Corporation" ; :hasStopWord "CORP" ; :hasStopWord "CORPORATION" ; a :Gazetteer . Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers. Microsoft Corp FIRST Y1 Review Meeting

  15. FIRST ontology OrientationPhrase Financial Instrument correlationDefinition InfluencesObject Correlation Definition Sentiment Object objectHasCorrelationDefinition Company correlationDefinitionInfluencesIndicator correlationDefinition InfluencesFeature featureHasCorrelationDefinition indicatorHas CorrelationDefinition Technical Indicator MacroIndicator Price Feature Fundamental Volatility MicroIndicator Reputation

  16. Annotation pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator Language detector ZeroMQ emitter ZeroMQ emitter Semantic annotator Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. Ontology-basedsemanticannotation RSS reader Demo video (3:00) RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting

  17. Technical WPs WP1 & WP8 WP2 & WP7 Architecture, Integration & Scaling Strategy UC#1 Market Surveillance UC#2 Reputational Risk management UC#3 Online Retail Brokerage Domain-independent GUI (Open Source) WP3 WP4 WP6 Management Dissemination & Exploitation Data Acquisition Ontology Infrastructure Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure We are here WP5 Information Integration Data, Information & Knowledge Base WP9 WP10 FIRST Y1 Review Meeting

  18. Sentiment Analysis • Object: Sentiment in financial web texts • Problem: Classificationofsentimentorientationwithrespecttoexpectedfuture … • pricechangeoffinancialinstruments • volatilitychangeoffinancialinstruments • reputationchangeofcompanies • Approach: Knowledge-basedsentimentclassification • Startingatthesentence-level • Specifictofeaturesofobjects(e.g., reputationof a company)

  19. Example Shortterm: uptrendSupport for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over. Longterm: bear marketThe Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastatingSupercycle bear market like that of 1929-1932. • Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007]. • Identificationanddifferentiationofobjects (andfeatures) • Relationshipsofindicators (e.g., earnings) andobjects http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry

  20. Manual Sentiment Annotation Topic FIRST Y1 Review Meeting

  21. FIRST Knowledge-basedSentiment Analysis Approach 1. Identify 2. Extract 3. Classifysentimentorientation {positive, negative} forall sentence-levelsentiments 4. Aggregate Rules,Ontology Allsentimentobjectsandfeatures Support for the SPX remains at 848 and then 789, with resistance at 912 … Support for the SPX remains at 848 and then 789, with resistance at 912 … Rules,Ontology All sentence-levelsentiments Scoring All sentiments in onedocument Document-levelsentiment score Averaging All sentimentscoresfor a givenday Sentiment Index  [-1,1]

  22. Sentence-levelSentiment Classification a) directly Example: „I expecttheS&P 500 torise“ positivesentiment  Addressedbyrules b) indirectly,via an indicator Example: „I thinkU.S. interestrateswill rise“ negativesentiment  Addressedbyontology

  23. Example Text: S&P 500 http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/ Oct 4, 2011 – 3:24 PM ET The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis.Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high. Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX , or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days.

  24. Sentiment Sentences on Price Change of S&P 500 Negative sentiment about the future price change of the S&P 500 FIRST Y1 Review Meeting

  25. Sentiment Sentences on Volatility Change of S&P 500 Positive sentiment about the future volatility change of the S&P 500 FIRST Y1 Review Meeting

  26. Document-level Sentiment with respect to multiple objects/features

  27. Initial Experiment Results • Accuracyofknowledge-basedsentimentclassification vs. standardmachinelearningmethods • Small manuallyclassifiedcorpus • Result: 7% moreaccurate • Portfolio selectionexperiment • Usesentimenttoselect Dow Jones stocks • Result: Excessreturnsseempossible  More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-basedApproach“, published in IEEE CEC 2011 conferenceproceedings FIRST Y1 Review Meeting

  28. Main Y1 Achievements • Data acquisitionsoftwarerunning • Sentiment analysisfor web texts • Sentence-level, andspecifictofeaturesofobjects • Initial experimentresultsare promising • Ontologyavailable (~4000 instances) • Sentence-levelannotatedcorpusavailable(900 documentsandgrowing) • DeliveredasD3.1andD4.1 • Bookchapteron dataacquisition in preparation • Paper on sentimentextraction (bestpaperawardat CEC 2011 conference) FIRST Y1 Review Meeting

  29. Next Steps • Improveontologyandgazetteers • Usecorpustoimprovesentimentclassification • Increasethroughputofsentimentextraction FIRST Y1 Review Meeting

  30. Thankyou 30

More Related