Language Identification and Modelling in Specialized hardware

Ivan Zapreev
Mind Map by Ivan Zapreev, updated more than 1 year ago
Ivan Zapreev
Created by Ivan Zapreev about 6 years ago


Paper by Kenneth Heafield, Rohan Kshirsagar, Santiago Barona

Resource summary

Language Identification and Modelling in Specialized hardware
  1. Introduction
    1. Natural language processing
      1. Tasks
        1. Language identification
          1. Is essential to using WEB as corpus
          2. Language Modelling
          3. Challenge
            1. Large data sizes
              1. Solution
                1. Use specialized Hardware
                  1. Graphics processing units - GPUs
                    1. Application
                      1. Neural Networks
                        1. Parsing
                      2. Field-programmable gate arrays - FPGAs
                        1. Features
                          1. Fast
                            1. Customizable
                            2. Application
                              1. Encoding Grammars
                  2. Research
                    1. idea
                      1. Repurpose Network security hardware
                        1. Application specific integrated circuit for network monitoring
                          1. Deterministic pushdown transducer
                            1. A finite state transducer (FST) is a finite state automata (FSA) which produces output as well as reading input - Finite State Machine!
                              1. Recursive finite-domain programs can be characterized by finite-state transducers that are augmented with a pushdown store. Such transducers are called pushdown transducers.
                            2. Has a stack
                              1. Programmable
                                1. Executes regular expressions
                                  1. POSIX
                                    1. When matched
                                      1. Outputs constant to CPU
                                        1. Use stack
                                          1. Push
                                            1. Pop
                                            2. Output matched span
                                              1. Halt
                                          2. No user accessible arithmetic
                                          3. Applications
                                            1. Reason
                                              1. Do not easily map to regular expressions
                                              2. Tasks
                                                1. Language Identification
                                                  1. Use model of Lui and Baldwin, 2012 for 97 languages
                                                    1. Naive Bayes model
                                                      1. Feature strings are converted to literal regular expressions
                                                        1. Collect feature counts
                                                        2. Emulate automata on CPU
                                                      2. Language modelling
                                                        1. Using back-off models
                                                          1. Using Telescoping series
                                                            1. A telescoping series is any series where nearly every term cancels with a preceeding or following term. For instance, the series


                                                              1. Collapse prob and back off into single function
                                                                1. Preserves sentence-level probabilities
                                                                  1. Sends just one value Q per token and not probability and back offs
                                                                    1. Saves CPU workload and communications
                                                                2. Simplified query
                                                                  1. For each word match as much context as possible
                                                                    1. Sends just one value Q per token and not probability and back offs
                                                                3. Use greedy matching
                                                                  1. Match as much leading context as possible
                                                                    1. Scanning until a match is found
                                                                      1. Report the longest match
                                                                        1. Resume scanning
                                                                          1. The longest matching N-gram will be reported
                                                                      2. Use "greedy" regular expressions
                                                            2. Experiments
                                                              1. Performance Evaluation
                                                                1. One core
                                                                  1. Language identification
                                                                    1. 2.4 times faster
                                                                      1. As the fastest CPU programm
                                                                        1. More details in the paper
                                                                          1. tested against some existing models
                                                                            1. CLD2
                                                                              1. C++
                                                                              2. Original
                                                                                1. Python
                                                                                  1. Java
                                                                          2. Language modelling
                                                                            1. 1.8 to 6 times faster
                                                                              1. As CPU program
                                                                                1. KenLM
                                                                                  1. DALM
                                                                                2. Part of speech
                                                                                  1. In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.


                                                                            2. Tarari T2540 PCI express
                                                                              1. Controlled by 1-thread CPU program
                                                                                1. Performed arithmetic
                                                                                  1. Scalable till 4 devices
                                                                            Show full summary Hide full summary


                                                                            A Level: English language and literature techniques = Structure
                                                                            Jessica 'JessieB
                                                                            A Level: English language and literature technique = Dramatic terms
                                                                            Jessica 'JessieB
                                                                            English Literary Terminology
                                                                            Fionnghuala Malone
                                                                            English Grammatical Terminology
                                                                            Fionnghuala Malone
                                                                            A Level: English language and literature techniques = Form
                                                                            Jessica 'JessieB
                                                                            English Rhetorical Device Terminology
                                                                            Fionnghuala Malone
                                                                            A2 English Language and Literature: Unseen
                                                                            Jessica 'JessieB
                                                                            Linguistic Methods
                                                                            Theories, Theorists and Tests
                                                                            English Language Techniques
                                                                            English Speech Analysis Terminology
                                                                            Fionnghuala Malone