1 / 18

Giga-Mining

Giga-Mining. Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999. Case Study. Statistical modeling Processing of multi-GB databases Data warehousing Prediction and classification User interfaces. Three Goals.

Download Presentation

Giga-Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999

  2. Case Study • Statistical modeling • Processing of multi-GB databases • Data warehousing • Prediction and classification • User interfaces

  3. Three Goals • Daily perform meaningful mining on multi-GB of data • Classify telephone numbers as business or residential (pattern deviation, etc.) • Maintain operational data for each phone number.

  4. Quantity of data • 1997: 275 million phone calls per week day -- total of 76 billion for whole year • 65M unique TNs per weekday • 350M unique TNs over a 40-day period • “Universe list”: Set of all TNs observed on network, each with a 7-byte profile

  5. Contents of each profile • Inactivity -- number of days since TN used • Minutes of use -- average daily minutes TN is observed on network • Frequency -- estimated number of days between observing a TN • “Bizocity” -- Business-like behavior of TN • Stored for inbound/outbound, toll/toll-free

  6. Calculation of each variable • Inactivity: Set to 0 if observed, and (Inactivity++) if not observed. • Other variables are calculated via an exponential weighted average: • X(TN)new = λX(TN)today + (1-λ)X(TN)old,0 < λ < 1

  7. Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time. Most recent day’s activity is weighted higher than 2 weeks ago. Weight of a call k days ago is wk= (1-λ)k λ Old data is “aged out” as new data is “blended in” Aging factor λ

  8. “Bizocity” • Concerns over whether a TN is residential or business. • Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc.

  9. “Bizocity” continued • AT&T has confirmed residential/business status for 30% of 350M TNs. • Incomplete data is due to lack of communication with local companies, additional lines, out of date information. • Behavioral estimate is generated by observing behavior of all 350M TNs, generating a bizocity score, and combining it with previous days’ totals.

  10. Generating “Bizocity” • When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers). • Those with known biz/res status are flagged, and training sets are generated. • Noise and outliers are usually eliminated by the volume of data.

  11. Generating “Bizocity” -- examples • Example: Long calls originating at night are usually residential, not business. • Example: Residential calls peak in eve., business calls peak between 9am-5pm • Example: Business calls are generally shorter, call other businesses, or call 800 services.

  12. Processed every 24 hours • Provides better aggregate data for each TN • Reduces I/O by 75% • Have to store all call details and sort them. • Each call is reduced to a 32-byte binary record, resulting in 8GB daily. • Sorting takes 30 min. (3GB RAM, 1 processor)

  13. Processing -- continued • 4d data cube is generated • Dimensions are day-of-week, time-of-day, duration, and biz/res/800 status (7x6x5x3) • Have previously developed logistic regression models for scoring TNs based on each profile (to estimate “Bizocity”) • Biz(TN)new = λBiz(TN)today + (1-λ)Biz(TN)old 0 < λ < 1

  14. Processing -- continued • Training set is used to classify TNs with unknown status based on probabilities • Inactive TNs are not updated • “Bizocity” scores for unknown TNs are generated using probabilities

  15. Accuracy • Accuracy of prediction of status is 75% • Failures due to incorrectly provided status of shifting status (ex. home businesses, cell phones, etc.)

  16. Data Structures • Exploit the “exchange” concept (1st 6 digits form an exchange) • Only about 150,000 of 1M exchanges are in use • All 10,000 TNs for each exchange are stored sequentially, whether used or not • Each data structure is 2GB for each variable (lower bound is 1.5GB)

  17. Interface • Variety of visualization tools (start at top, drill-down) • Web interface with password protection • Images are computed on the fly • C-code directly computes images in gif format

  18. Toll Fraud Detection • Same methodology, but event-driven • Only have to track about 15M TNs. • Profiles are about 512 bytes each (7.5GB)

More Related