1 / 25

Name matching for PATSTAT data

Name matching for PATSTAT data. Gianluca Tarasconi KITeS Database Administrator. Website: rawpatentdata.blogspot.com. KITeS Knowledge, Internationalization and Technology Studies.

oleg
Download Presentation

Name matching for PATSTAT data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator Website: rawpatentdata.blogspot.com 1

  2. KITeSKnowledge, Internationalization and Technology Studies KITeS’s mission is understanding the relationship between innovation, technology management, firms’ competitiveness and economic growth in the global economy. KITeS’ research intends to be rigorous, relevant and inter-disciplinary. It focuses on three main areas: innovation, technology management and trade. 2

  3. KITeS –The centre • KITeS was founded in 2008, building upon the experience of research centres such as CESPRI and CRITOM. It’s guested @ Bocconi University. • KITeS is an inter-departmental research centre, integrating researchers from the Economics Dpt., the Management Dpt. and the Institutional Analysis Dpt. KITeS researchers hold doctoral degrees from Yale, Stanford, London School of Economics, Bocconi, Manchester, Leuven, Sussex, Maastricht, and others. • Patent statistics have been widely used at KITeS for many years now, dating back to CESPRI's early research in industrial dynamics. • This tradition has led to the cumulative creation and updating of a large database, known as EP-CESPRI. Inventors' data used so far are organized in a sub-section of such database, known as EP-INV. • … who’s who: www.kites.unibocconi.it 3

  4. The EP-CESPRI Database (i) The EP‐CESPRI databasecontains information on patents applied for at the European Patent Office (EPO), from 1978 to October 2009. The EP‐CESPRI databasewas first created by making use of information downloaded regularly from EPO Bulletins. Since October 2007 it is based upon applications published on a regular basis by EPO in PATSTAT ; presently, it contains about 2.090.000 patent applications. A beta version for USPTO was released in 2009 and SIPO (chinese patent office) version is forecasted for 2010. 4

  5. The EP-CESPRI Database (ii) EP-CESPRI data fall into three broad categories: 1. Patent data, such as the patent’s publication number, its priority/application date, and main/secondary technological class (IPC12‐digit). 2. Applicant data, such as a unique code assigned by KITeS to each applicant after cleaning the applicant’s data, plus the applicant ‘s name and address. 3. Inventor data: such as name, surname, address and a unique code (CODINV) assigned by KITeS to all inventors found to be the same person. This section of EP-CESPRI is also known as EP-INV and it is the one of major interest to today’s seminar 5

  6. EP-INV: From raw data to structured data • Data coming from PATSTAT are cleaned, standardized and re-structured  CODINV2 code • Eventually a similarity score is calculated for pairs of inventors who have the same name and surname, but different addresses  CODINV code 6

  7. Standardization of inventors’ names and addresses CODINV2 codes CODINV codes Original EPO data on inventors come from PATSTAT table TLS206_ASCII, where data are only partially parsed for names, address, city, zip codes. Further steps are as follows: Cleaning of address data Cleaning of names Computation of similarity scores 7

  8. Cleaning of address data Parsed data are given a unique code (CODINV2) and (iteratively) cleaned by: shifting information contained in wrong fields (like zip code, county…);  standardizing city names or parts of names (e.g.: “Saint” is turned into “St.”); fixing mistakes in zip codes, according to national post office tables; In 10/2007 data there were 2.381.991 codinv2 in EP-INV DB out of 3.278.486 PATSTAT person_id (28% less). 8

  9. Example of city cleaning 9

  10. Cleaning of names The “name+surname” field was parsed into the following fields: first, second, third name, extension (e.g. Jr, Sr, III), surname, and academic title (e.g. Dr., Prof, Ing….). This operation was mainly based on two iterative steps: • Pairs of inventors with the same address and equal first name, surname, extension and initial of second or third name are corrected for the third name (e.g.: “Rossi Giovanni Paolo” is turned into “Rossi Giovanni P.”); • Pairs of inventors’ records where 2 out of the 3 fields city, address and name are the same and the remaining one has a low edit distance (Levenshtein/alfanum) are updated on the data for the inventor with the higher number of patents. 10

  11. An example 11

  12. Further info on cleaning names and addresses • Cleaning of names and address has been realized by MySQL; • The sql code is based on 25 lookup tables and 950 recursive queries; • The aggregation algorithm was quite conservative (to allow ‘new entries’ to be quickly linked); 12

  13. Computation of similarity score • Inventors data are restructured following a structure person (CODINV) vs person@location (CODINV2) • All inventors with anything different other than name and surname are compared in pairs, through the MassacratorSQL routine 13

  14. Introduction of CODINV 14

  15. Computation of similarity score Social networks: coinventors in common, 3 degrees of distance in coinventorship Toponymic permanence: same address, town, county… Workplace: same applicant/ company/ group Similarity Score IPC: patenting in the same tech fields Citation’s linkages: (self)citing or cited Time lag: how long since last patent? 15

  16. Scores by category 16

  17. Update of CODINV using similarity score Intuitively, high similarity scores can be taken as indication of a high probability that the two inventors in the pairs are the same person. Whenever two inventors in a pair are found to be the same the lowest CODINV code is assigned to both inventors. Algorithm should be run recursively 17

  18. Finding a threshold value (I) Manual checking of EP-INV records suggest that a large number paired inventors with total score higher than 20 are indeed the same person. Percentages vary across countries, largely because of the different distribution of frequent surnames. Therefore, no automatic re-assignment of CODINV codes has been performed so far. In KEINS research data have been extensively checked for IT, FR, SE; the threshold value of the similarity score was set at 15 (median value): inventors in pairs with score >= 15 are then presumed to be the same person, and assigned the same CODINV code. 18

  19. Finding a threshold value (II) • Manual checking suggests that: • no Type 2 error (false positives) is introduced with this choice, i.e. no pair of inventors are assigned erroneously the same CODINV code) • several Type 1 errors remains, i.e. pairs of inventors who are indeed the same person have scores <15 and are not given the same CODINV code 19

  20. Applying Massacrator to all EPO (I) • At 10/2007 we get 2.672.671 couples out of 2.363.501 inventors • Mode is 0 pts (764946 couples) but 758.471 couples have >= 15pts 20

  21. Applying Massacrator to all EPO (II) • 16,78 % of couples are >= 20 pts • 22,72% of couples are >= 15 pts 21

  22. Applying Massacrator to all EPO (III) A raw version of the algorithm for getting a proxy of the possible reductions may be same IPC (12 digits) OR same applicant OR same address OR 3 degrees of distance OR 1 coinventor in common OR citation linkage OR same IPC (6 digits) and same country Compressing 571970 CODINVs out of 2363501 (-24%) 22

  23. Some publications using the EP-INV data • Lissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research Evaluation, 17(2): 87-102. • Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181. • Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109 • Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432 • Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118 • Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf] • Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) - Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". Forthcoming 23

  24. Future uses of the algorithm (I) • Cross Patent-office match: Is J. Smith in EPO the same of USPTO ? • Decompression: Where toponymic data are few (USPTO data FI), a mere data cleaning would group inventors who are not the same; the algorithm could help to avoid type 2 errors 24

  25. Future uses of the algorithm (II) • Companies’ match: Identify applicants who have similar companies names as the same; • NPL match: Helping to deduplicate authors / affiliations 25

More Related