Back to Search
Start Over
Named Entity Recognition for Sensitive Data Discovery in Portuguese
- Source :
- Repositório Científico de Acesso Aberto de Portugal, Repositório Científico de Acesso Aberto de Portugal (RCAAP), instacron:RCAAP, Applied Sciences, Vol 10, Iss 2303, p 2303 (2020), Applied Sciences, Volume 10, Issue 7
- Publication Year :
- 2020
- Publisher :
- MDPI AG, 2020.
-
Abstract
- The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested&mdash<br />Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.
- Subjects :
- Conditional random field
Computer science
02 engineering and technology
computer.software_genre
lcsh:Technology
lcsh:Chemistry
0302 clinical medicine
0202 electrical engineering, electronic engineering, information engineering
General Materials Science
lcsh:QH301-705.5
Instrumentation
media_common
Fluid Flow and Transfer Processes
sensitive data
General data protection regulation
General Engineering
lcsh:QC1-999
Computer Science Applications
Random forest
language
Sensitive data
020201 artificial intelligence & image processing
Natural language processing
Process (engineering)
named entity recognition
general data protection regulation
03 medical and health sciences
Named-entity recognition
European Portuguese
media_common.cataloged_instance
natural language processing
European union
lcsh:T
business.industry
Process Chemistry and Technology
Data discovery
Statistical model
language.human_language
Named entity recognition
lcsh:Biology (General)
lcsh:QD1-999
lcsh:TA1-2040
Portuguese language
030221 ophthalmology & optometry
Artificial intelligence
lcsh:Engineering (General). Civil engineering (General)
business
computer
lcsh:Physics
Subjects
Details
- ISSN :
- 20763417
- Volume :
- 10
- Database :
- OpenAIRE
- Journal :
- Applied Sciences
- Accession number :
- edsair.doi.dedup.....26593eb5d93f8df330f68c59bce25625
- Full Text :
- https://doi.org/10.3390/app10072303